Closed Bug 1084021 Opened 10 years ago Closed 8 years ago

Some Famo.us app demos are janky and have super low FPS on Flame

Categories

(Firefox OS Graveyard :: Performance, defect, P3)

ARM
macOS
defect

Tracking

(tracking-b2g:-)

RESOLVED INCOMPLETE
tracking-b2g -

People

(Reporter: dietrich, Unassigned)

Details

Waiting for permission to publish my modified version of these demos for Firefox OS testing, but want to put this here to track the investigation.
Harald isolated the core of Famo.us' rendering pipeline here:

http://digitarald.github.io/openwebperf/3d-matrix-transform/direct.html

According to FPS HUD, getting 3 - 15 FPS on Flame device with APZ enabled, and somewhat better with APZ disabled, but never breaking >30 FPS.

Eric profiled the UltraVisual demo on a Flame device, where we're similarly getting very low FPS:

https://people.mozilla.org/~bgirard/cleopatra/#report=c15fe626eab25adc3582b8c321ba56b940fa606c
(In reply to Dietrich Ayala (:dietrich) from comment #1)
> Harald isolated the core of Famo.us' rendering pipeline here:
> 
> http://digitarald.github.io/openwebperf/3d-matrix-transform/direct.html
> 

That's just using gl-matrix, I'm not sure what it has to do with famo.us. Harald can you elaborate on this?

> Eric profiled the UltraVisual demo on a Flame device, where we're similarly
> getting very low FPS:
> 
> https://people.mozilla.org/~bgirard/cleopatra/
> #report=c15fe626eab25adc3582b8c321ba56b940fa606c

I'll try to get a few more profiles up. Generally the FPS issue is that famous' rAF is blowing way past it's 16.6ms limit (more along the lines for 40ms) on Flame.

A few things that are going to be inefficient:
  - Manually poking CSS style of dom nodes one at a time (opacity, transform, etc)
  - Doing matrix multiplication. It's possible if they switch to Float32Arrays they might get better performance
  - Generating the matrix3d(...) transform, there's probably a more efficient way of building that string
  - There's some Fx specific z-ordering hack that they probably don't need anymore (sounded like it was fixed in 30)

Internally we might want to make sure that setting css attributes doesn't actually cause recalculations if the value didn't change.
Flags: needinfo?(hkirschner)
> http://digitarald.github.io/openwebperf/3d-matrix-transform/direct.html
This isolates the performance bottleneck of setting a matrix3d(), which is the longest CSS value list and which therefor spends a lot of time just parsing values. gl-matrix is super efficient doesn't even make a blip in our sampling profiler.

> - Manually poking CSS style of dom nodes one at a time (opacity, transform, etc)
What is poking CSS?

> - Doing matrix multiplication. It's possible if they switch to Float32Arrays they might get better performance
I recommended that before, but again the matrix calculations don't have a big impact in profiles.

> - There's some Fx specific z-ordering hack that they probably don't need anymore (sounded like it was fixed in 30)
Agreed, that should have a big impact as z-index might not get the same GPU optimizations that transform and opacity get.

> - Generating the matrix3d(...) transform, there's probably a more efficient way of building that string
They should unroll that loop, yes. It is apparently not unrolled by the JIT in FF as the loop length is too short.
Flags: needinfo?(hkirschner) → needinfo?(dietrich)
A more useful profile (although this did have couple local changes): https://people.mozilla.org/~bgirard/cleopatra/#report=91ba041d18a1f1cbf02dd445defd8f823ec19891

This was generated with the following command:
  |./profile.sh start -i 10 -e 200000 -p b2g -t Compositor && ./profile.sh start -i 10 -e 200000 -p Ultravisual && sleep 5 && ./profile.sh capture|

High level we see:
 - 38.5% - js::RunScript  (this is essentially famous' run loop)
 - 21.1% - PresShell::Paint
 - 18.9% - PresShell::Flush (Flush_Style)
 -  7.4% - PrefShell::Flush (Flush_InterruptibleLayout)
 -  7.0% - IPDL::PCompositor::RecvDidComposite
 -  4.3% - IPDL::PBrowser::RecvAsyncMessage (I think this was the remote debugger)
(In reply to Harald Kirschner :digitarald from comment #3)
> > http://digitarald.github.io/openwebperf/3d-matrix-transform/direct.html
> This isolates the performance bottleneck of setting a matrix3d(), which is
> the longest CSS value list and which therefor spends a lot of time just
> parsing values. gl-matrix is super efficient doesn't even make a blip in our
> sampling profiler.

I see, the your hypothesis is just setting matrix3d is enough to get a poor framerate. I don't disagree, I'm just saying the famo.us runloop is taking > 16.6ms, so we've already lost.

> > - Manually poking CSS style of dom nodes one at a time (opacity, transform, etc)
> What is poking CSS?

Just in ElementOutput.js:
  * https://github.com/Famous/famous/blob/master/src/core/ElementOutput.js#L188
  * https://github.com/Famous/famous/blob/master/src/core/ElementOutput.js#L212
  * https://github.com/Famous/famous/blob/master/src/core/ElementOutput.js#L220
  * https://github.com/Famous/famous/blob/master/src/core/ElementOutput.js#L221
  * https://github.com/Famous/famous/blob/master/src/core/ElementOutput.js#L258
  * https://github.com/Famous/famous/blob/master/src/core/ElementOutput.js#L263
  * https://github.com/Famous/famous/blob/master/src/core/ElementOutput.js#L291

The engine walks a tree of nodes and updates them one at a time, so that could be a fair amount of overhead.

> > - Doing matrix multiplication. It's possible if they switch to Float32Arrays they might get better performance
> I recommended that before, but again the matrix calculations don't have a
> big impact in profiles.

FWIW it showed up in a few of the profiles I looked at.
Clearing ni? as Eric answered the questions.
Flags: needinfo?(dietrich)
Just to get on the same page:
The change that we propose is to queue up all their CSS writes flush it at once?

I'd expect the engine to flush all CSS after blocking JS. The code does not have any CSS reads or other DOM access that would trigger flushing. Is that not expected behavior?
Flags: needinfo?(erahm)
I'm not sure what the right way to implement things is, you might want to ask someone working on CSS. My understanding is messing w/ CSS attributes can cause reflows etc, so doing them one at a time like that could cause multiple reflows. Or I could be wrong. Looking at the profiles I referenced above it does seem like we do some JS, get blocked while some CSS internal stuff goes on, and then come back to processing.
Flags: needinfo?(erahm)
Eric: The profile shows up for me without any JavaScript: https://cloudup.com/cxofCNQfB9Y . If famo.us would ask me I couldn't tell them the timings for the bottlenecks.

ni? Benoit as the CSS/graphics expert: Can you confirm or deny the assumption above? Would queuing up and then flushing all CSS changes in JS really make a big difference vs CSS changes during a heavier loop. Other factors are that the code never reads CSS and does all changes on elements that should be on GPU due to CSS3 3D transforms.
Flags: needinfo?(erahm)
Flags: needinfo?(bgirard)
(In reply to Harald Kirschner :digitarald from comment #9)
> Eric: The profile shows up for me without any JavaScript:
> https://cloudup.com/cxofCNQfB9Y . If famo.us would ask me I couldn't tell
> them the timings for the bottlenecks.

I've been looking at the UltraVisual process, click on the middle bar.
Flags: needinfo?(erahm)
(In reply to Harald Kirschner :digitarald from comment #9)
> Eric: The profile shows up for me without any JavaScript:
> https://cloudup.com/cxofCNQfB9Y . If famo.us would ask me I couldn't tell
> them the timings for the bottlenecks.

You're looking at the compositor timeline there. There's JS execution but it's in the main thread. You need to select that timeline. From the 'Frames' overview the requestAnimationFrame is taking a considerable amount of execution time.

> 
> ni? Benoit as the CSS/graphics expert: Can you confirm or deny the
> assumption above? Would queuing up and then flushing all CSS changes in JS
> really make a big difference vs CSS changes during a heavier loop.

When CSS changes are made the 'Styles' are flagged as being out of date. When you return from javascript the browser may chose to repaint which will require flushing & recalculing any pending CSS/Style changes. Ideally all your CSS changes should be done in the same events or at the end of your requestAnimationFrame (rAF) callback. After the browser calls rAF then it will update CSS Style, reflow the page and repaint. So ideally all the changes you wanted to make are done before the end of rAF otherwise we're going to need to keep recalculating styles and repainting the changes.

There's of course exception, if you're going to be making a large change then you'll need to break that over multiple event to keep the page responsive.
Flags: needinfo?(bgirard)
> Ideally all your CSS changes should be done in the same events or at the end of your requestAnimationFrame (rAF) callback.

famo.us does that as there are no pauses in JS execution between setting styles.
Eric,

https://cloudup.com/cHQ79YGH5ib . I see 20% of the being spend in the last commit(). In commit() I see only native calls related to CSS parsing. This would even go up to 80% and more as the profile picks up the recursive calls in commit() as new branches. This confirms bug 1055932 for me.

I also could not see Benoit confirming your recommendation to queue all CSS updates as they are currently made in the current rAF callback without pauses.

dietrich on ni? to add his 2c on this.
Flags: needinfo?(erahm)
Flags: needinfo?(dietrich)
Forgot to go back to the initial argument from a previous comment:

> I'm just saying the famo.us runloop is taking > 16.6ms, so we've already lost.

CSS value parsing is synchronous right after a value is set (see the parsing related calls in platform in commit()); which is why the runloop explodes.
(In reply to Harald Kirschner :digitarald from comment #14)
> Forgot to go back to the initial argument from a previous comment:
> 
> > I'm just saying the famo.us runloop is taking > 16.6ms, so we've already lost.
> 
> CSS value parsing is synchronous right after a value is set (see the parsing
> related calls in platform in commit()); which is why the runloop explodes.

The JS portion is definitely taking > 16.6m (not counting CSS reflows etc).

Zoom in on a smaller sample set by selecting a range in the "Gecko Main" timeline, click the "Sample Range [...,...]" entry. From there you can see in the top bar how much time was spent on scripts (JS), how much was spent on reflow, etc. Hover your mouse over the orange block and that will tell you how much time was spent in scripts. A cursory examination shows the scripts are all > 16.6ms.
Flags: needinfo?(erahm)
> The JS portion is definitely taking > 16.6m (not counting CSS reflows etc).

I am not arguing that reflows are the factor but that the time accumulated in JS also includes *CSS parsing* (read: not reflow, but parsing). This is *not only JS* but also time spend in platform code due to JS calls.

Apart from the slow CSS parsing, one bug for famo.us to look at is why this demo triggers reflows at all.
Flags: needinfo?(erahm)
(In reply to Harald Kirschner :digitarald from comment #16)
> > The JS portion is definitely taking > 16.6m (not counting CSS reflows etc).
> 
> I am not arguing that reflows are the factor but that the time accumulated
> in JS also includes *CSS parsing* (read: not reflow, but parsing). This is
> *not only JS* but also time spend in platform code due to JS calls.
> 
> Apart from the slow CSS parsing, one bug for famo.us to look at is why this
> demo triggers reflows at all.

Yes, there are many ways JS code ends up passing through to C++. I suppose it's possible there's a more efficient way to update the CSS attributes of a bunch of nodes (regardless of reflows), I'm not sure what that is. I think for some of it they could have predefined classes and use those which might avoid additional parsing.

I agree avoiding reflows is a good thing! I think unfortunately they're poking enough CSS that it's unavoidable.
Flags: needinfo?(erahm)
Hi all,

Thanks for helping with some of the issues we are experiencing.

Points you have brought up that we are already have / are working on implementing in our newer version:

- Batched CSS writes.  Surfaces queue style changes and right at the end of the frame.

- Float32 array transforms.  Also implementing caching to not redo matrix multiplication for nodes that have not changed.

- Flattening of the stringification of the matrix3d property.  We used to have this but perfs from a few months ago showed that we ran faster on Chrome and Safari when looped.  We are going back as this seems to not be the case.

- Skipping rewriting of styles/transforms that do not change. https://github.com/digitarald/openwebperf/issues/3

Questions:

- If all Surfaces are positioned absolutely with the matrix3d property, shouldn't their style changes not cause reflows?

- It was suggested that since we are manually writing the transforms, to apply the perspective ourselves.  Is this faster than letting the browser handle this for us?  We currently have to do this anyways to get the homogenous coordinate system for WebGL so it should not be hard to do the same with DOM, barring unforseen quirks in the way the browser applies perspective. https://github.com/digitarald/openwebperf/issues/1

Issues:

- We are having antialiasing issues.  Example here http://michaelobriena.github.io/firefox/antialiasing/

- DOM Elements do not intersect as expected.  http://michaelobriena.github.io/firefox/intersection/

- Image rotation ends up playing with the drawn z-depth or scale.  This used to be a Nightly issue but was introduced in Firefox 34. http://michaelobriena.github.io/firefox/imageRotation/

Comments:

- We currently are animating with .999999 opacity on the root HTML node because of a bug in ios8 Safari's rendering.  This should go away once that is fixed. https://github.com/digitarald/openwebperf/issues/2

- We round to 6 decimals in the transform because based on the perspective we found that  anything less was not enough to properly align DOM elements.  It looks like this has a large performance impact but I don't know how much wiggle room we have with that number.  I will test more to see how big of an issue the alignment is for us.  https://github.com/digitarald/openwebperf/blob/gh-pages/3d-matrix-transform/rounding.js


The image rotation bug is our big worry right now as it makes Famo.us somewhat unusable on Firefox.  Thanks again for your help.
ni? on BenWa to provide some input on Mike's questions. I will also file follow up bugs for the test cases.
Flags: needinfo?(bgirard)
There's a lot of interesting discussion and potential platform bug. It's difficult to make progress on a particular issue when a bug tracks many issues so I suggest that anything that is important and deemed to be a platform bug be split off into its own bug.

(In reply to Mike O'Brien from comment #18)
> Hi all,
> 
> Thanks for helping with some of the issues we are experiencing.
> 
> Points you have brought up that we are already have / are working on
> implementing in our newer version:
> 
> - Batched CSS writes.  Surfaces queue style changes and right at the end of
> the frame.

I'm not sure if that's required. As long as you make all your changes in the same event and don't perform an operation/query that requires up to date style information you shouldn't need to defer your CSS changes. If you get this wrong in the profiles you'll see a style flush outside of the RefreshDriver so it should be noticeable with a CPU profiler.

> 
> - Float32 array transforms.  Also implementing caching to not redo matrix
> multiplication for nodes that have not changed.
> 
> - Flattening of the stringification of the matrix3d property.  We used to
> have this but perfs from a few months ago showed that we ran faster on
> Chrome and Safari when looped.  We are going back as this seems to not be
> the case.
> 
> - Skipping rewriting of styles/transforms that do not change.
> https://github.com/digitarald/openwebperf/issues/3
> 

If you find this to cause an improvement then we should spin off a bug to investigate this. I'd imagine we'd already handle this fairly well. If so we will check with the layout team.

> Questions:
> 
> - If all Surfaces are positioned absolutely with the matrix3d property,
> shouldn't their style changes not cause reflows?
> 

I believe it will still cause a restyle/reflow but the work shouldn't need to style/reflow the whole documente but only what changed. If the reflow/restyle perform for these changes is too high then we should investigate this in it's own bug.

> - It was suggested that since we are manually writing the transforms, to
> apply the perspective ourselves.  Is this faster than letting the browser
> handle this for us?  We currently have to do this anyways to get the
> homogenous coordinate system for WebGL so it should not be hard to do the
> same with DOM, barring unforseen quirks in the way the browser applies
> perspective. https://github.com/digitarald/openwebperf/issues/1
> 

I would imagine no but let's ask Matt Woodrow who implemented this.

> Issues:
> 
> - We are having antialiasing issues.  Example here
> http://michaelobriena.github.io/firefox/antialiasing/
> 
> - DOM Elements do not intersect as expected. 
> http://michaelobriena.github.io/firefox/intersection/

The above two issues are being worked on by :kip, I'll defer to him.

> 
> - Image rotation ends up playing with the drawn z-depth or scale.  This used
> to be a Nightly issue but was introduced in Firefox 34.
> http://michaelobriena.github.io/firefox/imageRotation/

Yuk, do we not already have a bug on file for this one? If not let's get one ASAP. This might be something we want to get to the release channel as fast as reasonably possible.

> 
> Comments:
> 
> - We currently are animating with .999999 opacity on the root HTML node
> because of a bug in ios8 Safari's rendering.  This should go away once that
> is fixed. https://github.com/digitarald/openwebperf/issues/2

It's best to avoid that, you risk missing some opaque fast paths.
> 
> - We round to 6 decimals in the transform because based on the perspective
> we found that  anything less was not enough to properly align DOM elements. 
> It looks like this has a large performance impact but I don't know how much
> wiggle room we have with that number.  I will test more to see how big of an
> issue the alignment is for us. 
> https://github.com/digitarald/openwebperf/blob/gh-pages/3d-matrix-transform/
> rounding.js
> 

AFAIK the number of decimal places wont make a difference. Our fast path require things like integer scale/translation/axis align. Having a 2, 3 or 6 decimals places will all be sufficient to throw it off the fast paths.
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(kgilbert)
Flags: needinfo?(bgirard)
(In reply to Benoit Girard (:BenWa) from comment #20)
> 
> > - It was suggested that since we are manually writing the transforms, to
> > apply the perspective ourselves.  Is this faster than letting the browser
> > handle this for us?  We currently have to do this anyways to get the
> > homogenous coordinate system for WebGL so it should not be hard to do the
> > same with DOM, barring unforseen quirks in the way the browser applies
> > perspective. https://github.com/digitarald/openwebperf/issues/1
> > 
> 
> I would imagine no but let's ask Matt Woodrow who implemented this.

This really shouldn't make any noticeable difference.
Flags: needinfo?(matt.woodrow)
> > - We are having antialiasing issues.  Example here
> > http://michaelobriena.github.io/firefox/antialiasing/
> > 
> > - DOM Elements do not intersect as expected. 
> > http://michaelobriena.github.io/firefox/intersection/
> 
> The above two issues are being worked on by :kip, I'll defer to him.
> 

The anti-aliasing issues will be corrected with Bug 766345.  This will be implemented with a distance-to-edge shader and will not depend on hardware AA support or a depth buffer.

The intersection of DOM elements will be corrected with Bug 689498.  As there is no depth buffer and DOM elements may contain transparency, this will be implemented with a plane-splitting algorithm.  Depth buffers for hidden surface removal will enable higher GPU utilization for fully-opaque geometry while introducing some limitations -- this may be worth exploring also once the general-case has been solved with plane-splitting.

I am expecting to implement both Bug 766345 and Bug 689498.  Please NI me on those bugs for more details if you would like.
Flags: needinfo?(kgilbert)
Flags: needinfo?(dietrich)
tracking-b2g: --- → -
Priority: -- → P3
Hardware: x86 → ARM
do you think this bug still needs to be kept opened ?
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.