Closed Bug 898556 Opened 11 years ago Closed 7 years ago

Triple-check GC tuning on FxOS (was Firefox OS Octane regression on nightly Nexus 4)

Categories

(Core :: JavaScript: GC, defect, P3)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED INVALID
blocking-b2g -

People

(Reporter: vlad, Unassigned)

References

Details

(Keywords: perf, Whiteboard: [c=benchmark p= s= u=])

On a Nexus 4, Octane scores:

Android:      Firefox 22: 1616
            Nightly (25): 2163

Firefox OS: Nightly (25): 1125

this seems like something just flat-out bad is going on; I would expect Octane numbers to be identical between Android and Firefox OS.

(There's a nexus 4 build config ("mako") as part of stock b2g now.)

Here are the individual test numbers, both for nightly; Android first, FxOs second

Richards:    3703  3834
Deltablue:   1958   874 (!)
Crypto:      3150  2756
Raytrace:    2367   542 (!)
EarleyBoyer: 2669  1020 (!)
Regexp:       362   286
Splay:       1874  1076 (!)
NavierStokes:5634  5617
pdf.js:      1306   731 (!)
Mandreel:    2051  2133
GB Emu:      3701  1794 (!)
CodeLoad:    2841  2126 (!)
Box2DWeb:    1458  1650

It looks like there's something just going seriously bad with JS perf when built in a b2g config.
Well, so for a start, the GC prefs are different on FxOS and Fennec.  Specifically:

1)  Fennec has pref("javascript.options.gc_on_memory_pressure", false).
2)  FxOS changes the mem.gc_incremental_slice_ms from 10ms to 30ms.
3)  FxOS changes various GC frequency parameters, which can change GC timing.
4)  FxOS sets a high water mark of 6MB, whereas Fennec uses 32 (or maybe 16 on low-memory
    devices).
5)  FxOS sets the allocation threshold to 1MB (the default value, as used on Fennec, is
    20MB... except in said low-memory device config where it's 3MB.

So my money is on FxOS doing a lot more GC during benchmarks...
Let me update prefs and see what happens!
Okay, definitely GC.  If I take the low memory settings from Fennec (with a few changes, since I think I screwed up; 32mb high water mark instead of 16mb and 3mb alloc threshold), the overall result drops to 635. Some benchmarks like Deltablue drop down to 129.

Here's the numbers again, with the third column being FxOS with same js gc prefs as Firefox for Android; mostly inline, with 2 significant drops and one surprising win.

Overall:     2163  1125  2109
Richards:    3703  3834  3689
Deltablue:   1958   874  1831
Crypto:      3150  2756  2908
Raytrace:    2367   542  1945 (still significant drop)
EarleyBoyer: 2669  1020  3062
Regexp:       362   286   389
Splay:       1874  1076  1259 (still significant drop)
NavierStokes:5634  5617  5606
pdf.js:      1306   731  1366
Mandreel:    2051  2133  2141
GB Emu:      3701  1794  3525
CodeLoad:    2841  2126  2751
Box2DWeb:    1458  1650  1792 (a little surprising win?)

Not sure where this leaves us for B2G.
Summary: Firefox OS Octane regression on nightly Nexus 4 → Triple-check GC tuning on FxOS (was Firefox OS Octane regression on nightly Nexus 4)
One interesting question is whether the tuning should be different for different FxOS processes.  I can maybe see us wanting less aggressive GC in the browser process than elsewhere, maybe.
The GC settings have been tuned on a Unagi, which as far as I know, are closer to our target market. (Bug 863398)  The change made at the time are still visible on AWFY [1], can you check if reverting this settings locally improve benchmarks on Nexus 4 ?

One of the constraint on the choice of these GC settings was that we at least got a chance to GC before we reach the end of the memory, as Bug 863398 comment 21 details, this limit is low, because of our target market.

[1] http://arewefastyet.com/#machine=14

(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #3)
> Raytrace:    2367   542  1945 (still significant drop)
> Box2DWeb:    1458  1650  1792 (a little surprising win?)

These can be a shift in the scheduling of GCs.  This happens a lot on the unagi runnning AWFY.  Have a look at the breakdown of octane.

> Splay:       1874  1076  1259 (still significant drop)

This benchmark had a lot of GC noise on its own when GCs were disabled.  Can you double check this one on both platforms.
Keywords: perf
Whiteboard: [c= ]
Cc'ing a few folks for visibility -- there's a lot of talk about Octane in marketing, and right now running Octane on FxOS will give a significantly worse result than Android no matter how low or high end the device is.

nbp, reverting the settings on FxOS did make a difference -- the perf recovered, that's what the third column of numbers is in comment #3
The Nexus 4 has 2gb of RAM.  The most powerful device we're shipping B2G on today has 512mb of RAM.

I don't think it's a useful exercise to tune B2G to run on a device with 4x as much memory as the highest-end device we're shipping on.  Can we instead compare B2G to Android on 256mb and 512mb devices and tune there?
For comparison, we don't even support Android Firefox with less than 384mb, 50% more than the FxOS minimum.  Furthermore, on FxOS with 256mb, we have like 100mb available for apps, so adding another 128mb is really more than doubling the amount of available RAM for the benchmark.  (I'm not sure how much memory is available to FxOS on Android.)
Can we set the GC params based on the size of the device's memory?  Doing this in an ad hoc manner sounds bad; are there more cases like this where we'd like to set system parameters based on device resources?  (Bug 892097 is a recent example I've seen although not compelling enough on its own.)
> Can we set the GC params based on the size of the device's memory?

I don't want to beg the question as to whether or not that's necessary, but I don't think there's anything standing in our way from doing this.  But again, I think we should be focusing on device configurations that we actually ship on, so the question should be whether 256mb B2G devices need different params from 512mb B2G devices, noting that those may need different params from 256mb/512mb Android.

wrt comparing between Android and B2G, note also that B2G has multiple Gecko processes running at the same time, and this necessitates being more conservative with the max allowable JS heap size, on a per-process basis.
I'm not suggesting that we tune for the Nexus 4, or any other high-end device.  I was going to make a suggestion similar to Luke's.

I think the core issue is this: right now, if you compare FxOS to Firefox on Android on Octane, on identical/similar devices, FxOS will look significantly worse.  Many of our partners are doing exactly this, and they're looking at next gen FxOS devices.
Okay, it sounds like we're all in agreement.  Just one more piece of clarification:

> I would expect Octane numbers to be identical between Android and Firefox OS.

I'm not sure we should expect this.  I don't know whether Android or the FFOS browser has more RAM available to it, on a device with Xmb of RAM, but that could reasonably affect the scores or how we tune the GC.
Whiteboard: [c= ] → [MemShrink] [c= ]
Our boot image for the Nexus 4 throttles the device down ( Nexus4-HW-Shrink-Helix-like-boot.img )

  1. CPU Cores: 2
  2. CPU Freq.: 1G HZ
  3. GPU Freq.: 320M HZ
  4. Framebuffer: 480 * 800
  5. Memory: 512 MB

Original Nexus 4 Spec:
  1. CPU Cores: 4
  2. CPU Freq.: 1.7G HZ
  3. GPU Freq.: 400M HZ
  4. Framebuffer: 720 * 1280
  5. Memory: 2 GB

This doesn't appear to be a Memshrink issue.
Whiteboard: [MemShrink] [c= ] → [c= ]
Bumping this to ? again -- we're shipping 1.4 on newer/faster/more modern devices, and the GC tuning can make a huge difference in performance.  Let's not be shooting ourselves in the foot.
blocking-b2g: --- → 1.4?
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #14)
> Bumping this to ? again -- we're shipping 1.4 on newer/faster/more modern
> devices, and the GC tuning can make a huge difference in performance.  Let's
> not be shooting ourselves in the foot.

Does this accurately describe tarako?  That seems like a more constrained device.  I wonder if we need multiple sets of tuning parameters somehow based on the hardware.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #14)
> Bumping this to ? again -- we're shipping 1.4 on newer/faster/more modern
> devices, and the GC tuning can make a huge difference in performance.  Let's
> not be shooting ourselves in the foot.

I already added a function which currently distinguish between desktop/mobile such as we can emulate the GC settings of a phone on the JS shell, this is just a matter of plumbing to add a preference which set the GC settings based on the memory available on the device.

(In reply to Ben Kelly [:bkelly] from comment #15)
> I wonder if we need multiple sets of tuning parameters somehow
> based on the hardware.

As I discussed with Vlad, this is not the ideal solution, but this is an easy target as soon as we can make a good benchmark for looking at GC pauses.  The benchmark I made for tweaking the results for the Unagi was extremely noisy and this made everything hard to tune.

I am sure there is more variable in the problem than the memory available on the device, such as the memory latency, but we should gather way more devices before investigating these other variables.
Assignee: general → nicolas.b.pierron
Flags: needinfo?(nicolas.b.pierron)
Do we need this for the tarako memory work?  It seems like there is work here to use the benchmark for the Unagi tweaking on the Tarako for the same purpose.
Flags: needinfo?(jcheng)
Flags: needinfo?(bkelly)
I imagine re-evaluating our tuning for a device with half the memory of our other devices would be a good idea.

I think the open question, though, is how to support tarako and the larger devices with a single code base.  If I understand correctly Nicolas is working on this.  He'll need to indicate if its possible in the short term for tarako.
Flags: needinfo?(bkelly)
The GC parameters really need to be tuned per-device (or at least tuned separately for each amount of memory) for best performance.
(In reply to Dave Huseby [:huseby] from comment #17)
> Do we need this for the tarako memory work?  It seems like there is work
> here to use the benchmark for the Unagi tweaking on the Tarako for the same
> purpose.

All I need for tuning for a device is:
 - A device.
 - Good benchmarks.
 - A months of benchmarking (to plot this 7 dimension space)
 - A week to understand and refine the previous search.

At the moment, I have none of these.
Mike, what can we do to get Nicolas a tarako device?
Flags: needinfo?(mlee)
Nicolas, what benchmarks did you use when tuning for the unagi?  Are those not applicable for other devices?  Maybe I'm not sure what you mean by "benchmark".
I used Octane, and a modified version of the incremental GC test made by bill which I called snappy[1].  Snappy was not really useful as it was extremely noisy, and recently Octane is becoming noisy too.  These benchmark should be refined to ensure that there is as little noise as possible.

[1] http://people.mozilla.org/~npierron/snappy-bench/
Joe, can we get Nicolas a Tarako device?
Status: NEW → ASSIGNED
Flags: needinfo?(mlee)
Whiteboard: [c= ] → [c=benchmark p= s= u=]
1.3T? to discuss this during Tarako triage
blocking-b2g: 1.4? → 1.3T?
Flags: needinfo?(jcheng)
triage: not going to be in time for tarako timeframe. minus
blocking-b2g: 1.3T? → -
Depends on: 1216286
Assignee: nicolas.b.pierron → nobody
Status: ASSIGNED → NEW
Component: JavaScript Engine → JavaScript: GC
Flags: needinfo?(nicolas.b.pierron)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.