Closed Bug 959281 Opened 10 years ago Closed 10 years ago

http://paperjs.org/examples/voronoi/ drawing performance has regressed badly

Categories

(Core :: General, defect)

All
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla31
Tracking Status
firefox29 - wontfix
firefox30 + fixed
firefox31 --- fixed

People

(Reporter: mayankleoboy1, Assigned: smichaud)

References

(Blocks 2 open bugs, )

Details

(Keywords: perf, regression, Whiteboard: [STR in comment #19][description in comment #98 and comment #100])

Attachments

(6 files, 2 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0 (Beta/Release)
Build ID: 20140113030203

Steps to reproduce:

1. start a new profile
2.  Go to http://paperjs.org/examples/voronoi/
3. move the mouse around 


Actual results:

extremely slow animation/JS


Expected results:

lightning fast, just like a few months/weeks back.

This demo used to be butter smooth some time back.

I dont know how real world this demo is, so cant comment on the importance of it.


tentativel marking in JS engine.
Would you be willing to bisect the regression? I see that you set the version to 23, so I assume that's the first version you see the regression in?
Flags: needinfo?(mayankleoboy1)
(In reply to Till Schneidereit [:till] from comment #1)
> I see that you set the
> version to 23, so I assume that's the first version you see the regression
> in?

i mistakenly marked the version as 23. i dont know where this regressed.  
The surprising thing is that if you keep on moving the mouse around, and switch tabs a few times, the slowness disappears.

I think a better STR is :

1. Load the given URL.
2. load a few tabs .
3. Close Nightly
4. Restart nighty
5. Do a session restore
6. Come to the tab where this URL is opened , and move the mouse around


> Would you be willing to bisect the regression?

I am really sorry, but i dont have time to bisect the regression :(
Flags: needinfo?(mayankleoboy1)
Version: 23 Branch → Trunk
I bisected this bug and I believe this is a regression in Nightly build 2014-01-09:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=711f08b0aa1d&tochange=37516445a0b5
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows 7 → All
Hardware: x86_64 → All
There are graphics changes in there too... a profile shows no obvious hotspots for me other than FillRect(), but even that is at 20% or so.  At least on Mac.
I don't see any JS changes in that range that are likely regressors.

There's bug 938385, but I doubt this website hits that size limit and even then, it would be very surprising if it hurts performance like this. And there's my patch for bug 957628, but that's also a pretty simple change that just avoids some slow calls in some cases. And some other trivial patches.
I bisected mozilla-inbound builds and I believe this is the regression pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=8342bfb3ef52&tochange=4a3e51bf999b

kats: Could your APZC changes (bug 907179) in this range affect this drawing demo performance? I'm using a MBP with a Retina display.
Flags: needinfo?(bugmail.mozilla)
The APZC changes shouldn't affect Firefox on Mac, since APZC isn't used there. It's a pretty narrow regression window though, so it's probably worth trying a build with that backed out.
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> The APZC changes shouldn't affect Firefox on Mac, since APZC isn't used
> there. It's a pretty narrow regression window though, so it's probably worth
> trying a build with that backed out.

Will you be backing it out so we can see if that fixes this?
I can't build for mac locally right now but I pushed to try:

https://tbpl.mozilla.org/?tree=Try&rev=5b38e2bb9e68
Can you test to see if the backout has addressed this?
Flags: needinfo?(mayankleoboy1)
could you point me to  a Win7 build ?
Flags: needinfo?(mayankleoboy1)
with the latest nightly, i can still reproduce.

Better STR :

1. create a new profile
2. open google.com in a tab
3. Open the given URL in a new tab, and switch to it.
4. While being in the Voronoi tab, right click on the google.com tab, and select "Reload Tab".
5. While the google.com tab is reloading, move mouse inside the benchmark.  It would help if you have a slow  network.

While the google.com tab reloads, the voronoi will be lag quite a lot.
(In reply to Benjamin Kerensa [:bkerensa] from comment #10)
> Can you test to see if the backout has addressed this?

Nothing has been backed out from mozilla-central, so Nightly should still have the same perf problems. In comment 9, kats was just pushing to the Try server to get an OS X build with a backout.
Olli: I confirmed that this perf regression was caused by cset e176126d2b42, your fix for AfterProcessNextEvent hang bug 953435.
Blocks: 953435
Flags: needinfo?(bugs)
Summary: http://paperjs.org/examples/voronoi/ has regressed badly → http://paperjs.org/examples/voronoi/ drawing performance has regressed badly
bug 953435? Really? That should have affected to Mac only. This was filed on Windows.
Flags: needinfo?(bugs)
But I can't reproduce this on windows. Aurora and Nightly builds feel just the same.
I've only tested Mac and, while the performance difference is subtle, it is apparent if you move the mouse pointer quickly around the red dots. The moving red dot is laggy.

I bisected local opt builds in the pushlog range from comment 6 (though I had to apply the patches individually because the build is broken during that range of changesets).
Component: JavaScript Engine → General
Chris, please provide *detailed* information on how you reproduce this bug.  Also please check to see that you still see it with a clean profile.
STR:
1. Create a new profile with Nightly 29.
2. Load http://paperjs.org/examples/voronoi/
3. See the big red shapes.
4. Move your mouse pointer horizontally along the white space between the red shapes.
5. See that the moving red shape is a little laggy and jumps to catch up with the mouse pointer.
6. Now create a new profile with Aurora 28.
7. Perform steps 2–4.
8. See that the moving red shape moves much more smoothly and keeps pace with the mouse pointer.

I am running OS X 10.9.1 on a "Late 2013" MacBook Pro Retina.
Thanks Chris.

I do notice a big difference in performance between these two nightlies (testing on OS X 10.9.1 on a Retina MacBook Pro, as you did):

firefox-2014-01-09-03-02-03-mozilla-central
firefox-2014-01-10-03-06-50-mozilla-central

I tested with the page (http://paperjs.org/examples/voronoi/) maximized, on the assumption that this would increase the likelihood of seeing the bug.

I'll also test on other versions of OS X.  And ultimately I'll redo your hg bisect -- to see if I get the same result.

By the way the voronoi site must have a poor network connection (or maybe the server it's running on is dodgy) -- when I first tried to visit it, I got a "site not found" error from Nightly.
its easier to reproduce the lag if there is a tab loading in the background, even if the tab is very simple like google.com
> firefox-2014-01-09-03-02-03-mozilla-central
> firefox-2014-01-10-03-06-50-mozilla-central

Actually I see a narrower regression range:

firefox-2014-01-09-03-02-03-mozilla-central
firefox-2014-01-09-09-49-02-mozilla-central

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=711f08b0aa1d&tochange=effa95701c83

This range still includes the patch for bug 953435, though.

I'll post the results of my hg bisect once I have it.
> firefox-2014-01-09-03-02-03-mozilla-central
> firefox-2014-01-09-09-49-02-mozilla-central
>
> http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=711f08b0aa1d&tochange=effa95701c83

I've confirmed this same regression range on OS X 10.8.5 (on a non-Retina Mac Pro) and 10.7.5 (on the same Retina MacBook Pro on which I tested in OS X 10.9.1).

For some reason the regression is much worse (more noticeable) on my MacBook Pro.  So that's where I'll hg bisect.
Reporter, you're seeing this on Windows?
I've confirmed Chris's result.  I find that Smaug's patch for bug 953435 triggers this bug:
http://hg.mozilla.org/mozilla-central/rev/e176126d2b42

I used hg bisect, and then confirmed the result "by hand" - I found that this bug happens with a build made with the tree as it was as of the http://hg.mozilla.org/mozilla-central/rev/e176126d2b42 revision, but not as it was as of the revision just prior (http://hg.mozilla.org/mozilla-central/rev/010f5faed9cf).

This is terribly puzzling.  Tomorrow I'll do some profiling to see what I can find.

Reporter, I suspect you're actually seeing this on OS X, but reported the bug on Windows.  Otherwise what you're seeing is completely unrelated to what Chris and I have seen (since Smaug's patch is Mac-only).

Smaug, do you have a Mac you can test on?
People who want to check out this bug should use opt builds.  Performance at http://paperjs.org/examples/voronoi/ is much worse with non-opt builds, to the point that it isn't easy to see the difference made by Smaug's patch.
I don't have a mac.
Attached file about support.txt
(In reply to Olli Pettay [:smaug] from comment #24)
> Reporter, you're seeing this on Windows?

Yes, i am seeing this on Win7 machine.  Sorry for not attaching about:support earlier.

(In reply to Steven Michaud from comment #25)

> 
> Reporter, I suspect you're actually seeing this on OS X, but reported the
> bug on Windows.  Otherwise what you're seeing is completely unrelated to
> what Chris and I have seen (since Smaug's patch is Mac-only).

No it really is a Win7 machine :)
I don't have a Windows 7 machine to test on, but I do have a Windows 7 VM that I can run (on a Mac) in VMWare.  With it I don't notice any performance difference between Firefox 26 and today's mozilla-central nightly, using Chris' STR from comment #19.

Chris, do you have an actual Windows 7 machine (not a VM) you can test on?  If you do, I'd be curious to see what you think.

Smaug, you said above (comment #16) that you don't see this bug on Windows.  But were you running Windows 7?
Reporter, I know you don't have time to do a full regression range search.  But do please try this (which won't take much time at all).

Download the following two mozilla-central nightlies, and see if you notice any performance difference between them:

firefox-2014-01-09-03-02-03-mozilla-central
firefox-2014-01-09-09-49-02-mozilla-central

Firefox nightlies (including these two) can be found at http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/.
firefox-2014-01-09-09-49-02-mozilla-central  appears to be mac only.   I cant run those.
Oops, sorry.  Try the following instead:

firefox-2014-01-09-03-02-03-mozilla-central
firefox-2014-01-10-03-06-50-mozilla-central
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-01-10-03-06-50-mozilla-central/firefox-29.0a1.en-US.win32.installer.exe

apears to be smoother.
 But with a caveat : comment 21 still applies to both the builds. So when a background tab is *reloading* , the animation grinds to almost a complete halt.  But is that a different bug ?
My guess is that what you're seeing is entirely unrelated to what Chris and I are seeing.

Since what Chris and I are seeing is *much* more easily reproducible, I'm going to concentrate on that.
(In reply to Steven Michaud from comment #30) 
> Smaug, you said above (comment #16) that you don't see this bug on Windows. 
> But were you running Windows 7?
Yes, Windows 7
(In reply to Steven Michaud from comment #35)
> My guess is that what you're seeing is entirely unrelated to what Chris and
> I are seeing.
> 
> Since what Chris and I are seeing is *much* more easily reproducible, I'm
> going to concentrate on that.

should i file a new bug about it ?
> should i file a new bug about it ?

It's up to you, but I think probably not.

In all your comments, you've never said in any detail what you're actually seeing.  So no-one else here has been able to reproduce what you've reported.  (Chris at first thought he had, but that turns out not to be true.)  Unless you can provide that kind of detail, there's probably no point in opening another bug.
I don't have a Windows machine to test.
firefox-2014-01-09-03-02-03-mozilla-central
firefox-2014-01-09-09-49-02-mozilla-central

I used 'top' to see what performance differences I could detect between these two nightlies on this bug's testcase (at http://paperjs.org/examples/voronoi/).  I found that the *bad* nightly (the one with poorer performance) actually uses *less* CPU than the good one.  This suggests the problem is caused by event starvation.  I'll keep digging.

Here's how I tested:

1) ssh into my testbed and run 'top'.
2) Run Firefox, maximize its browser window and visit http://paperjs.org/examples/voronoi/.
3) Rapidly move (not drag) the mouse around over the red shapes to the right.
Assignee: nobody → smichaud
OS: All → Mac OS X
Whiteboard: [STR in comment #19]
I can think of two basic approaches to fixing this bug:

1) Rewrite Smaug's patch for bug 953435.
2) Get to the bottom of the event-starvation business.

At least for the next few days I'm going to pursue approach #2.  It's much riskier, but frankly it's much more interesting, and also much likelier to produce long term good results.  I've been meaning for years to go over the code at nsAppShell::ProcessNextNativeEvent(), but have never found the time.  This bug gives me the excuse to at least begin the task.  I know a lot more about Mac reverse-engineering than when I wrote that code, so hopefully I'll be able to unlock a few mysteries.

Approach #1, if it's possible, is certainly safer and probably easier.  I'll fall back to it if I have to.  But in the meantime, can I ask you to work on it, Smaug?  I know you don't have a Mac, but I can test out your ideas (if you can't find anyone else to do so).
What #1 would mean? In which way rewrite? The patch was super simple so it is quite surprising to
see it causing this kind of regression - though, after hacking appshell to get rid of perf-mode, 
I can see how touching anything close to main loop is super regression risky.
> In which way rewrite?

I don't know.  I was hoping you'd have some ideas, since you figured out the original bug :-)

If you don't, then just leave it.  Even if I come up with something using approach #2, it'll have to bake for a *long* time before getting into a release.  So I'll probably still also have to pursue approach #1 -- but not until I've spent some time working on approach #1.
> but not until I've spent some time working on approach #1.

but not until I've spent some time working on approach #2.
I'll let you know immediately if some good idea comes to my mind :)
A profile of this testcase shows that the mousemove events are processed alternatingly with two callstacks: -[GeckoNSApplication sendEvent:] is either called directly from -[NSApplication run], or it's called from the nested native event peaking in nsAppShell::ProcessNextNativeEvent:

Common top part:

-[NSWindow sendEvent:]
-[BaseWindow sendEvent:]
-[ToolbarWindow sendEvent:]
-[NSApplication sendEvent:]
-[GeckoNSApplication sendEvent:]

This part is only present for every other event:

nsAppShell::ProcessNextNativeEvent(bool)
nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool, unsigned int)
nsAppShell::OnProcessNextEvent(nsIThreadInternal*, bool, unsigned int)
nsThread::ProcessNextEvent(bool, bool*)
NS_ProcessPendingEvents(nsIThread*, unsigned int)
nsBaseAppShell::NativeEventCallback()
nsAppShell::ProcessGeckoEvents(void*)
Events::ProcessGeckoEvents
__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__
__CFRunLoopDoSources0
__CFRunLoopRun
CFRunLoopRunSpecific
RunCurrentEventLoopInMode
ReceiveNextEventCommon
BlockUntilNextEventMatchingListInMode
_DPSNextEvent
-[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]

Common bottom part:

-[NSApplication run]
nsAppShell::Run()
nsAppStartup::Run()
XREMain::XRE_main(int, char**, nsXREAppData const*)
Startup::XRE_Main
XRE_main
main

---

Is it possible that the refresh driver timer event (a Gecko event) isn't handled directly after the mouse move event because, when ping-ponging between these two event processing states, Gecko events don't get their chance to be handled as long as a native event is pending?
> Is it possible that the refresh driver timer event (a Gecko event)
> isn't handled directly after the mouse move event because, when
> ping-ponging between these two event processing states, Gecko events
> don't get their chance to be handled as long as a native event is
> pending?

I'll look into this.  I'll also try to find a way to stop the
ping-ponging.

But my first order of business is to figure out the AppKit gets its
events from the window server.  Two undocumented methods are always
used -- _DPSNextEvent() (a non-exported method in AppKit) and
BlockUntilNextEventMatchingListInMode() (a public method in HIToolbox,
called from the former).  I'm hoping that using the latter method
directly will allow us more control -- enough to use it instead of
always calling -[NSApplication
nextEventMatchingMask:untilDate:inMode:dequeue:], which I discovered
long ago won't do the job.

(Side note:  I'm pretty sure the name _DPSNextEvent is a living
fossil, like a coelecanth.  I suspect "DPS" stands for "display
PostScript", which would take it back to the earliest Next days.  The
AppKit framework also has some other symbols that contain "DPS".)
To add some fun to the mix, bug 930793 landed on Jan-16 which changed event handling (especially the relation between gecko/native events), and today bug 962911 landed and got backed out (should bring event processing closer to how it was before bug 930793 landed).
We are going to track this as we see the potential for this being a issue if not watched and addressed.
I've spent the last couple of days digging around in the assembly code of _DPSNextEvent() and BlockUntilNextEventMatchingListInMode(), plus some of the methods they call.  Along the way I discovered that most of this code spins the event loop (via CFRunLoopRunInMode()), sometimes multiple times.  But there's also a way to get events from the window server *without* spinning the event loop -- AcquireFirstMatchingEventInQueue().  This is actually documented!  Though it's a Carbon Event Manager method, it's also used (indirectly) by the two methods mentioned above.

This patch is an experiment (a pretty radical one).  It processes events without spinning the event loop when nsAppShell::ProcessNextNativeEvent() is called with aMayWait false.  And even then it only processes *some* of them -- mostly the random undocumented junk that keeps the OS happy.  It leaves the others (for example hardware events like keyboard and mouse events) in the queue to be processed by subsequent calls to nsAppShell::ProcessNextNativeEvent() with aMayWait true.  When that happens I assume it's safe to spin the event loop (and also to wait indefinitely for an event), so I use -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] as we have before.

This *isn't* the final patch.  I've already found one fairly serious problem (though only on OS X 10.6), and I'm sure there are others.  But it works surprisingly well in my limited testing, especially for being such a big change.  For example it fixes both this bug and the reproducible part of bug 962942 (see bug 962942 comment #9).

Please try it out.  In fact please use it as your main browser for several days.  Bang away at it as hard as you can, and please report any problems here (after checking that they don't already happen in current trunk code).

I've started a tryserver build here:
https://tbpl.mozilla.org/?tree=Try&rev=13235562d9e6
See Also: → 962942
Do you have any idea how 953435 caused this one? Should we somehow tweak InGeckoMainEventLoop()?
> Do you have any idea how 953435 caused this one?

I think Markus (comment #46) is probably right that the patch for bug 953435 increased what he called the "ping-ponging", and caused/triggered this bug (and other related bugs).

> Should we somehow tweak InGeckoMainEventLoop()?

With luck we can get rid of it :-)
Here's the tryserver build made with my patch from comment #50:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-13235562d9e6/try-macosx64/firefox-29.0a1.en-US.mac.dmg

The tests haven't all finished, but so far things are looking good.
I've been using the build from comment 53 for 2 days. So far I haven't noticed bug 962942 even once. I don't resize much manually, but I also didn't notice this bug with this build. So far so good.
Still using the build from comment 53. I did see an issue once (possibly twice) where the main window lost focus and clicking it anywhere (including the title) couldn't bring the focus back. I.e. The window title stayed light gray, clicking links didn't work, etc (the mouse cursor changed only over some elements, and clicking worked only for tabs and menus, like bookmarks at the toolbars, but no where else).

Only way to "fix" it was to command-tab to another application and then get back to nightly.

I don't think I've noticed this issues with normal nighlies.
Trying to bring back the old behavior where InGeckoMainEventLoop==true
means the time either when we're not in nsThread::ProcessNextEvent
at all, or not in a nested nsThread::ProcessNextEvent call.
Note, before bug 953435 recursionDepth==0 meant that, but when using 
GetRecursionDepth, we get recursionDepth+1, since nsThread::mRunningEvent is
increased after the thread observers are called, and decreased before calling them again.

Hopefully it compiles
https://tbpl.mozilla.org/?tree=Try&rev=f5a4245d88a1
And if it does, someone should test it.


(I think we need something like this for FF29.)
(In reply to comment #55)

> I did see an issue once (possibly twice) where the main window lost
> focus and clicking it anywhere (including the title) couldn't bring
> the focus back.

I've seen this too, though only with an earlier version of my patch.
However the fact that you've seen it with my latest patch shows that I
haven't yet fixed the problem.

I still consider the results of my "radical simplification" very
promising, and will pursue it for the rest of this week.  I'll post
new patches and tryserver builds along the way.
(In reply to comment #56)

Smaug, I'll try your patch, though I doubt it will fix the bug.

If I can get my "radical simplification" working smoothly, we should land that on the trunk to give it more exposure.  And if no problems arise that can't be fixed, we should let it ride the train -- though it's probably best not to accelerate its progress.

However, even if my "radical simplification" does ride the train, the fact that it's a big change increases the risk of subtle bad side effects that may take a while to appear.  So it does make sense to have a band-aid fall back ... if it's possible to write one.  I'll switch to that once I have a "radical simplification" patch that I'm satisfied with.
Attachment #8365988 - Flags: feedback?(smichaud)
Comment on attachment 8365988 [details] [diff] [review]
Try to fix InGeckoMainEventLoop

> Smaug, I'll try your patch, though I doubt it will fix the bug.

Indeed it doesn't.
Attachment #8365988 - Flags: feedback?(smichaud) → feedback-
Here's a revision of my "radical experiment".  It cleans things up a bit, corrects one mistake, and comments out more code that my patch will (I hope) render obsolete.

I've started a tryserver run here:
https://tbpl.mozilla.org/?tree=Try&rev=6974e79cb2d8

I'll post a link to the tryserver build when it becomes available.
Attachment #8365428 - Attachment is obsolete: true
(Following up comment #50)

> I've already found one fairly serious problem (though only on OS X 10.6)

Turns out this also happens in current trunk code, so it has nothing to do with my patch.  (I'll open a bug on it in the next few days.)
(In reply to comment #46)

I've been trying to replicate your analysis, Markus.  But I haven't been able to figure out how you did it.  Please let us know in detail.

Please also compare the results you get with (say) today's m-c nightly and what you get with rev1 of my "radical experiment" patch.
Flags: needinfo?(mstange)
(Following up comment #61)

I opened bug 964541.
I think we need a band aid for FF29, and land the scarier patch early in the next cycle.

(Still wondering why my patch didn't help. Something like that should work. Basically limiting the
 change from bug 953435 to the time between AfterProcessNextEvent and --mProcessingEvent. ...but I see, nsBaseAppShell::OnProcessNextEvent is the problem. We need to have a flag when we're executing that method.)
Attached file testcase (obsolete) —
paperjs.org is down for me, so here's a simple testcase.
(In reply to Steven Michaud from comment #62)
> (In reply to comment #46)
> 
> I've been trying to replicate your analysis, Markus.  But I haven't been
> able to figure out how you did it.  Please let us know in detail.

Here are new profiles with the testcase I uploaded:

(1) Without patch:
http://people.mozilla.org/~bgirard/cleopatra/#report=bfd898197ea3c494be2e4a00173c365543e9d97a

(2) With smaug's patch (attachment 8365988 [details] [diff] [review]):
http://people.mozilla.org/~bgirard/cleopatra/#report=871ee5c6ad7a05958b167a94972c1b8652587eac

(3) With patch your latest patch (attachment 8366280 [details] [diff] [review]):
http://people.mozilla.org/~bgirard/cleopatra/#report=7b8a9c4eb29e1b8d91ec952682690722eff062de

Profiles (1) and (2) basically look the same: Whenever we get into nsThread::ProcessNextEvent, instead of processing the time event, we directly call nsAppShell::OnProcessNextEvent (which, under nsAppShell::ProcessNextNativeEvent, processes the next native mouse move event) and then do not process any Gecko events, even though nsThread::ProcessNextEvent is the place that should do that.

Profile (3) looks much better: nsThread::ProcessNextEvent always processes Gecko events, and when it calls nsAppShell::OnProcessNextEvent (and through that, nsAppShell::ProcessNextNativeEvent), no native event gets processed.

The last part confuses me a bit, but there you are. I haven't actually looked at your patch yet.
Flags: needinfo?(mstange)
(In reply to comment #66)

Sorry to bug you again, Markus, but I also did my own profiles.  What I don't know is how you got your information *out* of your profiles.

(In reply to comment #65)

> paperjs.org is down for me, so here's a simple testcase.

Your testcase is a bit more subtle (at least for me) than the paperjs.org voronoi testcase.  I mirrored the site a few days ago.  If it stays down I'll put up the mirror somewhere.
Flags: needinfo?(mstange)
(In reply to comment #64)

> I think we need a band aid for FF29, and land the scarier patch
> early in the next cycle.

Sounds reasonable to me.  But (given the complexities of the case) a
workable bandaid may not be possible.  In that case, how do you feel
about backing out the patch for bug 953435 in FF 29?
I think bandaid should be doable...just writing a patch.
(not having a Mac makes this just a bit slower)
Attached file better testcase
I think I misdiagnosed the role of nsThread::ProcessNextEvent: It does process a Gecko event, just not the refresh driver event. It looks like there's another Gecko event in the queue before it most of the time; I haven't found what it is yet.

This testcase should be a bit more reliable because it posts its own Gecko event before it schedules the refresh driver timer.
Attachment #8366629 - Attachment is obsolete: true
(In reply to comment #69)

Whatever bandaid patch you (or I) come up with will most likely be the result of semi-random guessing, and trial and error.  We may be lucky.  But I think it's much more likely that we won't be.
The bandaid patch should bring back the old behavior, but not hang in case of mutation observer callbacks.
Of course.  But my point is that, in this case, that's much easier said than done.
Attached patch like this?Splinter Review
Untested an all, and try is closed so couldn't even push there.
(In reply to Steven Michaud from comment #67)
> Sorry to bug you again, Markus, but I also did my own profiles.  What I
> don't know is how you got your information *out* of your profiles.

Right. It's a bit hard to describe, but I'm going to try.

The first thing I try look for when looking at a profile is whether there are any patterns in the main thread histogram. The main thread histogram is displayed at the top, next to "GeckoMain". In profile (1), it looks similar to a piece-wise constant function: constantly lower parts alternate with constantly higher parts. The heights of the bars in the histogram represent stack depths. The horizontal axis is time. If you click on a bar in the histogram, you are shown the call stack of that particular sample: The call tree in the center of the profiler selects the leaf function of the selected sample, and the list in the grey box at the right shows the complete call stack.

You can zoom in to parts of the profile by drag-selecting a part in the histogram and then clicking on the "Sample range" button that appears in the bar at the top. You can zoom out by clicking the "Complete Profile" button.

The profiles taken from my testcase contain bits that are a little confusing: There are some very short call stacks that you can see if you zoom in a bit; I'm referring to those that begin their stack with (root) > Startup::XRE_main. Those samples can be ignored; they're there because Date.now() (which the sleep function in my testcase calls) calls mach_absolute_time and the profiler can't unwind that function properly, so we get a messed-up callstack for those samples. Please ignore them.

Now let's look at the rest of the call tree. In it, you can find the function -[GeckoNSApplication sendEvent:] in two places, under the call stacks that I pasted in comment 46. If you select either of them, about half of the histogram lights up in green. Green means that those samples happened under the selected call stack. By drag-selecting in the histogram you can see the durations of the calls; each sendEvent call takes about 100ms. That's because the testcase calls sleep(100) for each mouse move event.

Above the main thread histogram, there's a bar labeled "Frames". The markers in that part show when we paint. If you zoom in very closely around the first of these markers, for example to Sample Range [6937, 6946], you can click on the sample under the "Layout" marker and see that this is where the refresh driver repaints the page: The callstack of that sample contains nsRefreshDriver::Tick. Going towards the root of the callstack, you come across nsTimerEvent::Run, which is called from nsThread::ProcessNextEvent. So now we know that nsTimerEvent::Run is the thing that happens too rarely, and which we'd ideally like to see after each mouse move event.

If you collapse the direct child nodes nsThread::ProcessNextEvent and select each of them, you can see that most of the time nsThread::ProcessNextEvent only calls nsAppShell::OnProcessNextEvent and not nsTimerEvent::Run.

I think that covers most of it. Do you have any other questions?
Thanks Markus!

This is exactly what I wanted, and it helps a lot.  With these new techniques (new to me, that is), I hope to be able to build a good patch and a solid conceptual foundation for it.
(In reply to Olli Pettay [:smaug] from comment #74)
> Created attachment 8366783 [details] [diff] [review]
> like this?
> 
> Untested an all, and try is closed so couldn't even push there.

This patch doesn't fix it for me.
Flags: needinfo?(mstange)
Comment on attachment 8366783 [details] [diff] [review]
like this?

> like this?

Nope.  It seems to make no difference at all.
Why... what am I missing in the patch? I wish I could actually test this stuff :/
It really shouldn't be too complicated to add a special case for the case when we spin event loop
after nsAppShell::AfterProcessNextEvent but before nsThread::ProcessNextEvent returns.
You are testing m-c/m-i from the time when bug 953435 landed, and not anything more recent?
Since there has been other changes to event loop after that.
> It really shouldn't be too complicated

But is *is* in this case, as I've been saying.

You really need a Mac to do what you're trying to do -- which is random guessing and trial and error.  I will get to it myself eventually -- probably later this week.
Smaug, I've been testing your patches in current trunk code.  But believe me, the results wouldn't be any different if I tested in the tree as of when your patch for bug 953435 landed.

Part of the problem is that *none* of us has a really good understanding of why this bug happens -- though both Markus and I have been trying to build one up.  And the code that I'm trying to replace was written before I really had the tools to acquire a good understanding of what I was doing (a good understanding of what Apple does under the hood, and the implications of (in effect) having two event loops (a Gecko one and a native one).  So even the *original* work was done using a lot of guessing and trial and error.  *Very careful* guessing and trial and error, whose results have stood the test of time well.  But because of this the "original code" is fragile -- even small changes can have unexpected side effects.
I'm slowly making progress figuring out this bug, but I've still got some way to go.

Markus, I think you're on the wrong track with what you say in comment #75:  Testing with this bug's original testcase (the voronoi example), I don't notice large gaps or irregularities in the timing of calls to nsRefreshTimer::Tick(), or of our paints in the Frames bar (comparing today's m-c nightly with either my rev1 patch or the firefox-2014-01-09-03-02-02-mozilla-central nightly, which is the last before the patch for bug 953435 landed).

But I *do* notice gaps in the timing of calls to onMouseMove() in the voronoi example (the calls can stop happening for several seconds) with today's m-c nightly that I don't see with either of the other two builds.  I'm going to find out as much as I can about what's causing these gaps.
(Following up comment #84)

I should also have mentioned that I don't see gaps or irregularities (with any build) in the timing of calls to -[ChildView handleMouseMoved].
Since we don't have a fix, I think I'll backout bug 953435 still today to get that in before the next
merge.
> But I *do* notice gaps in the timing of calls to onMouseMove() in the voronoi example

This was an illusion.  Turns out onMouseMove() can be called on many paths, especially with the patch for bug 953435.  I wasn't looking at all the paths.  (I discovered the best way to do this in Benoit Girard's profiler is to filter on "onMouseMove()".)

After several more hours of testing, I now *think* I see gaps in the timing of calls to Path._changed() (in the voronoi example) with the patch for bug 953435 that aren't present without it.  These "gaps" last about 200ms, and though they're also present without the patch, there are a lot more of them with the patch.

I'll keep trying at least a bit longer to find a step-by-step explanation of why this bug happens.  But ultimately I may have to give up and fall back to a general explanation of why my "radical simplification" *should* work.
> I now *think* I see gaps in the timing of calls to Path._changed()

This, too, was an illusion.

> I don't notice large gaps or irregularities in the timing of calls to nsRefreshTimer::Tick()

But looking at them again, I do notice that there are a lot fewer of them per unit of time with the patch for bug 953435 than without it.  And this in turn is caused by the fact that the delay between nsTimerImpl::PostTimerEvent() and nsTimerEvent::Run() is a lot longer with the patch for bug 953435 -- on average and especially in the worst cases.

Next week I'll try to figure out why.
Steven, have you been able to figure out why?
29 beta is getting closer! :)
Flags: needinfo?(smichaud)
I took last week off from this bug to deal with other bugs.

Which was a good thing, since I now have some new ideas I'll be trying out over the next few days.

My hope is still to come up with both a bandaid patch and "radical redesign" patch.  As we've seen, finding a bandaid patch is much more difficult, but it should be within my grasp if I can get a better understanding of why this bug happens.

If I can't find a bandaid patch by the end of this week, I'll concentrate on my "radical redesign" patch.  That should land early in the development cycle, and it may now be too late to get it onto the 30 branch.  If so, I'll plan to land it on the 31 branch.

In any case, the patch for bug 953435 should *not* reland until we get at least a bandaid patch for this bug.  Bug 953435 doesn't appear to be a topcrasher, so fixing it isn't urgent.
Flags: needinfo?(smichaud)
OK. So, I will consider your message as a wontfix for 29 then.
Thanks !
See Also: → 977856
Blocks: 977856
This code is bailing out like crazy.  It seems to be bailing out at the same point in the same JS function, but the function's IonCode is not being discarded and the function forced to run in baseline-only mode.

Here's where the bailout is happening:

Bailout_Normal at pop on line 12174 of http://people.mozilla.org/~stmichaud/paperjs.org/assets/js/paper.js:12172

Here is the stack at that point:
    js::Shape::replaceLastProperty(js::ExclusiveContext*, js::StackBaseShape&, js::TaggedProto, JS::Handle<js::Shape*>)
    SegmentPoint() @ paper.js:5027
    Segment() @ paper.js:4755
    js::RunScript
    paper</<.statics.read() @ paper.js:319
    paper</Path<.add() @ paper.js:6505
    createPath() @ paper.js line 12313 > eval:76
    renderDiagram() @ paper.js line 12313 > eval:25
    onMouseMove() @ paper.js line 12313 > eval:17
    js::RunScript
    callHandlers() @ paper.js:614
    paper</Callback.fire() @ paper.js:605
    paper</Tool<._fireEvent() @ paper.js:10883
    paper</Tool<._handleEvent() @ paper.js:10906
    handleMouseMove() @ paper.js:10040
    mousemove() @ paper.js:10051
    js::RunScript
    nsEventDispatcher::Dispatch
    event::nsViewManager::DispatchEvent
    Startup::XRE_Main
    (root)
As I asked at but 982317 comment #3, what exactly is a "bailout event"?  Why does it happen?
I strongly suspect the high number of bailout events is unrelated to this bug.  For a more detailed discussion see bug 982317.
Blocks: 982370
I'm working on this again, after taking another break.  And I've had another breakthrough -- though I want to consolidate it a bit more before saying much about it.

My breakthrough is finding out that the reduction in frequency of calls to RefreshTimer::Tick() (mentioned in comment #88, which is the proximate cause of this bug) is associated with a (relatively) large number of calls to [NSApp sendEvent:] (in nsAppShell::ProcessNextNativeEvent()) while aMayWait is false.

Up til now I've spent all my time on the mechanics of getting events.  But now I've found I need to also pay attention to the mechanics of sending events, and that it can interfere with getting events.  I'm not yet sure whose "fault" this is -- Apple's or ours, or some combination.  But I hope to find out more next week.

I've also found a half-decent bandaid patch.  Though I'll wait to say anything about it until I know more why it works.
Comment #96 is incorrect -- I was testing without optimization.  Sorry for the confusion :-(
I think I finally understand what causes this bug.  Here's a first
stab at describing it from soup to nuts.  I'll break my description up
into several comments -- I hope this will make it easier to read.

As I mentioned above in comment #88, the proximate cause of this bug
is that there are a lot fewer calls to nsRefreshDriver::Tick() per
unit of time.  This in turn is caused by the fact that the delay
between nsTimerImpl::PostTimerEvent() and nsTimerEvent::Run() is a lot
longer with this bug -- on average and especially in the worst cases.
In other words nsTimerEvents are "starved" -- they're not processed
quickly enough.

The nsRefreshDriver class uses a RefreshDriverTimer class to time its
"ticks", and an nsTimerEvent object (an nsIRunnable object) to "fire"
each "tick".  All nsIRunnable events are handled in
nsThread::ProcessNextEvent() -- including nsTimerEvent events.

But nsThread::ProcessNextEvent() also calls OnProcessNextEvent() on
all registered observers, including the nsAppShell class.  And
nsBaseAppShell::OnProcessNextEvent() (indirectly) calls
nsAppShell::ProcessNextNativeEvent(), potentially multiple times.  And
the behavior of nsAppShell::ProcessNextNativeEvent() was altered by
the patch for bug 953435 (which triggered this bug, and has since been
backed out).

Install the Gecko Profiler
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/Existing_Tools
and run the following test in the
firefox-2014-01-09-03-02-03-mozilla-central nightly (which doesn't
contain the patch for bug 953435) and the
firefox-2014-01-09-09-49-02-mozilla-central nightly (which does):

Load this bug's testcase
(http://people.mozilla.org/~stmichaud/paperjs.org/examples/voronoi/)
and move (not drag) the mouse around over the red "blobs" to the right
for about 30 seconds.  Then click on the Gecko Profiler and choose
"Analyze".

If you focus the callstack on "nsThread::ProcessNextEvent", you'll
notice a striking difference between the performance of the two
nightlies:  In the one with the patch for bug 953435 (and with this
bug), calls to nsAppShell::OnProcessNextEvent() eat more CPU than
anything else (called from nsThread::ProcessNextEvent()), and four or
five times as much CPU as calls to nsTimerEvent::Run().  In the one
without the patch for bug 953435 (and without this bug) it's the calls
to nsTimerEvent::Run() that eat more CPU than anything else, and about
three times more CPU than the calls to
nsAppShell::OnProcessNextEvent().

What you're seeing here is calls to nsTimerEvent::Run() being starved
when this bug happens.

So then what makes nsAppShell::OnProcessNextEvent() eat so much more
CPU and cause this bug?

If you look at what's called from nsAppShell::OnProcessNextEvent()
(itself called from nsThread::ProcessNextEvent()) without this bug,
you'll see few or no calls to -[GeckoNSApplication sendEvent:].  But
with this bug you'll see *tons* of them, and you'll find that they're
responsible for almost all the CPU eaten by
nsAppShell::OnProcessNextEvent().  If you look further down the stack,
you'll see that almost all the CPU is eaten by calls to
nsEventDispatcher::Dispatch().

To sum up the tale so far, this bug is caused by tons of calls
(indirectly) to nsEventDispatcher::Dispatch() being made from
nsAppShell::OnProcessNextEvent() that are normally made from
elsewhere, or not at all.

I'll continue my explanation in my next comment.
(Following up comment #98)

Turns out I'm still missing some steps for the next part of my explanation.  So I'll say "continued next week" :-)
(Continuing from comment #98)

Without this bug, ~98% of the CPU consumed (indirectly) by calls to
-[GeckoNSApplication sendEvent:] is consumed in calls made from
nsAppShell::Run(), and ~2% is consumed in calls made from
nsThread::ProcessNextEvent() (via
nsAppShell::ProcessNextNativeEvent(), mostly at the lowest level of
recursion).  With this bug only ~58% is consumed in calls made from
nsAppShell::Run(), and altogether ~42% is consumed in calls made from
nsThread::ProcessNextEvent() (at several different levels of
recursion).

So clearly we want most calls to -[GeckoNSApplication sendEvent:] to
be made from nsAppShell::Run().  But how do we accomplish this?

nsAppShell::ProcessNextNativeEvent() (and below it
nsBaseAppShell::OnProcessNextEvent()) is called *much* more often with
aMayWait == false than it is with aMayWait == true (almost 100 times
more often without this bug).  So calling -[GeckoNSApplication
sendEvent:] only when aMayWait == true reduces the total number of
calls made from nsAppShell::ProcessNextNativeEvent() (and increases
the proportion made from nsAppShell::Run()).  Also the whole point of
aMayWait is that we should be allowed to eat more CPU when it's true.

So one obvious way to fix this bug, or at least to work around it, is
to only call -[GeckoNSApplication sendEvent:] from
nsAppShell::ProcessNextNativeEvent() when aMayWait is true, or at
least to minimize the number of calls made when aMayWait is false.

When I say "this bug" I'm mostly referring to the "performance" issue
referred to in comment #19.  But there's a good chance this strategy
will also fix bug 962942 (which seems to happen when certain native
events, like mouse-up events, are somehow "skipped" or not processed).

I haven't been able to reproduce bug 962942, so I haven't been able to
analyze it in detail.  But notice from the first paragraph here that,
with this bug, calls to -[GeckoNSApplication sendEvent:] are more
likely to happen at different levels of recursion.  (And remember what
Markus said about ping-ponging in comment #46.)  It may be that if a
native mouse-up event is "sent" to Gecko while a native mouse-down
event is being handled by Gecko (at a different level of recursion),
it won't get handled in the right order, or at all.
Attached patch Bandaid patchSplinter Review
I haven't yet given this much testing.

I'll start tryserver builds with this patch by itself, and this patch together with one that restores the patch for bug 953435 (which triggered this bug).
The combination of my bandaid patch and the restored patch for bug 953435 (http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-3ec045248356/try-macosx64/firefox-31.0a1.en-US.mac.dmg) fixes this bug and also bug 953435, bug 977856 and bug 982370.

I'll use it for a day or two as my main browser to see how it fares.
Whiteboard: [STR in comment #19] → [STR in comment #19][description in comment #98 and comment #100]
I've spun my "radical simplification" patch off into bug 996848.
See Also: → 996848
Comment on attachment 8406478 [details] [diff] [review]
Bandaid patch

I tested with my bandaid-patch-plus-bug-953435 build all day and didn't see any problems.

Once this is landed I'll ask smaug to reland the patch for bug 953435.
Attachment #8406478 - Flags: review?(spohl.mozilla.bugs)
Attachment #8406478 - Flags: review?(spohl.mozilla.bugs) → review+
https://hg.mozilla.org/mozilla-central/rev/c1a8921153d9
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Steven, can we get an uplift request for aurora (if still relevant)? Thanks
Flags: needinfo?(smichaud)
Comment on attachment 8406478 [details] [diff] [review]
Bandaid patch

Without the patch for bug 953435, this patch should make little difference.  But with it my bandaid patch will fix this bug and should also fix bug 962942.  This will allow the patch for bug 953435 to be relanded, which will fix that bug plus bug 977856 and bug 982370.

Once the patch for bug 953435 relands on trunk, it should also be uplifted to aurora.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 953435
User impact if declined: The patch for bug 953435 can't be relanded, leaving several important bugs unfixed.
Testing completed (on m-c, etc.): One day's testing as main browser by myself
Risk to taking this patch (and alternatives if risky): low
String or IDL/UUID changes made by this patch: none
Attachment #8406478 - Flags: approval-mozilla-aurora?
Flags: needinfo?(smichaud)
Attachment #8406478 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
QA Whiteboard: [good first verify]
There's no point in verifying this until either the patch for bug 953435 relands (which hasn't happened yet) or the patch for bug 996848 lands (which obsoletes the bandaid patch for this bug and that for bug 953435).
Flags: in-qa-testsuite?
Keywords: verifyme
(In reply to Steven Michaud from comment #111)
> There's no point in verifying this until either the patch for bug 953435
> relands (which hasn't happened yet) or the patch for bug 996848 lands (which
> obsoletes the bandaid patch for this bug and that for bug 953435).

Steven, as far as I can tell, the current situation is:
- patch for bug 953435 hasn't relanded
- patch for bug 996848 landed, but in Firefox 32
As I understand it, there's no point to verify this in Firefox 31, unless something goes wrong with patch for bug 996848 (or unless it gets uplifted to 31, which does not seem to be the case). This means we would only be verifying this in Firefox 32, as part of the fix for bug 996848.

Is my understanding correct? If yes, I would rather remove "verifyme" from this bug and set it for bug 996848, for verification there.
QA Whiteboard: [good first verify]
Flags: needinfo?(smichaud)
I think it now *does* make sense to verify this bug.  You'd be verifying that it doesn't happen in current trunk, with the patch for bug 996848 (and various followup patches) on the trunk.

Here's what I'd do.  Test with the STR from comment #19 and my copy of the original testcase at http://people.mozilla.org/~stmichaud/paperjs.org/examples/voronoi/.

1) Test with the firefox-2014-01-09-09-49-02-mozilla-central m-c nightly to see the bug.
2) Test with today's m-c nightly to (hopefully) no longer see the bug.
Flags: needinfo?(smichaud)
> As I understand it, there's no point to verify this in Firefox 31

Correct (now that I read your comment more carefully).

It only makes sense to verify this bug for branches that have the patch for bug 996848, which is 32 and up.
(In reply to Steven Michaud from comment #114)
> > As I understand it, there's no point to verify this in Firefox 31
> 
> Correct (now that I read your comment more carefully).
> 
> It only makes sense to verify this bug for branches that have the patch for
> bug 996848, which is 32 and up.

Thanks Steven, we'll verify this in Firefox 32 as part of bug 996848.
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.