Closed Bug 1014649 Opened 7 years ago Closed 7 years ago

goog.math.Long is extremely slow on FxOS

Categories

(Core :: JavaScript Engine, defect, P3)

23 Branch
x86
macOS
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: Harald, Unassigned)

References

()

Details

(Keywords: perf, Whiteboard: [c= p= s= u=])

Test case is extracted from Webogram, the web version of Telegram, and does some crypto math that depends on 64-bit numbers using goog.math.Long: https://github.com/digitarald/webogram-issue-40

On Desktop (MBP) this test finishes in less than one second. On FxOS mid-end device (Flame) this takes 30-200s. This seams unreasonably slow even persists across FxOS 1.1 to 1.4.

I could not record any useful profiles that would show where time is spend. You can install the app on a phone using App Manager and the manifest from digitarald.github.io/webogram-issue-40/manifest.webapp
I was perusing the source and came across a potential point of interest:

There's a section in the google.math.Long code that reads like follows:

// NOTE: the compiler should inline these constant values below and then remove
// these variables, so there should be no runtime penalty for these.
goog.math.Long.TWO_PWR_16_DBL_ = 1 << 16;
goog.math.Long.TWO_PWR_24_DBL_ = 1 << 24;
goog.math.Long.TWO_PWR_32_DBL_ =
  goog.math.Long.TWO_PWR_16_DBL_ * goog.math.Long.TWO_PWR_16_DBL_;
...


Now, I think V8 and JSC will optimize away any accesses to these properties and constantize them.  Spidermonkey currently doesn't do that.  Since they are integers and doubles, TI will capture their type, but all the code referencing them will still read those values from the object slots.  This is probably compounded by our imprecise alias analysis, which would cause us to emit more reads than usual.

It should be possible to attach value-constraints to singleton types to address this.  But first we should figure out if this is the actual problem or not.

Hannes, what do you think?
Flags: needinfo?(hv1989)
Kannan, do you see a workaround we could apply in the short-term?

If the issue is really on those consts, inlining them manually would fix it?
Flags: needinfo?(kvijayan)
It would help, but there are a number of uses that cannot be inlined.

goog.math.Long.ZERO, ONE, NEG_ONE, MIN_VALUE, and MAX_VALUE are all instances of Long which probably don't recieve a singleton type (so TI can't constantize them on compile).

But it should be possible to test if at least some of the perf can be won back by manually inlining the constants that can be (namely: TWO_PWR_***_DBL_ fields), and seeing if it helps.

Note that escape analysis also plays a significant role here.  For example, multiply for two negative numbers is implemented as:  this.negate().multiply(other.negate()), and each of the negate calls creates a temporary Long object that doesn't survive past the frame.  I think v8 can eliminate these allocs, but we can't.  Work on escape analysis is underway, though.

It would require a bit of dedicated analysis to arrive conclusively at what's causing the loss of performance here.

How urgent is this issue, Harald?
Flags: needinfo?(kvijayan)
I just had a thought about an ad-hoc fix to the JS code to get it to perform better.

TI should be able to operate on these constants if they were turned into trivial functions returning the constants - the functions get assigned singleton types and the calls inlined, and the inlined call ends up desugaring back to a constant.

I just verified that this works on a small test case.  I'll convert the Long code to this idiom and see if it causes any improvements.
Thanks a lot Kannan for your help. I'm looking forward to see how that transforms the library. And I always grave for more insights into JITting as you know ;)
Ok just tried that.  I don't think it's the constantification that's causing problems for us.  I tried this on my laptop just to get initial numbers, and we actually do better on average than Chrome does.  Also, the difference between the modified and non-modified code is dwarfed by the noise.

The next suspect is garbage collection.  If we're allocating a lot of Longs, and V8 and eliminate a lot of the allocations via escape analysis, we may be leaking short-lived items from the nursery into the tenured space, and forcing more regular full GCs.

Looking at the code, there are temporary Long instances allocated all over the place.

It fits with the symtoms on desktop since we expect desktop machines to face a lot less memory pressure in response to allocation than an FxOS device.  Will try to verify whether full GC is hitting a lot on FxOs with this test case.
Kannan, any other angle to attack this? I guess we will just explore more libraries for 64-bit Long operations.

Adding Mike as well for a juicy challenge: Who might have the most expertise to help with high performance crypto algebra?
Flags: needinfo?(mlee)
Flags: needinfo?(kvijayan)
(In reply to Harald Kirschner :digitarald from comment #7)
> ...
> Adding Mike as well for a juicy challenge: Who might have the most expertise
> to help with high performance crypto algebra?

No commitment but cc'ing Dave Huseby who might have relevant experience.
Flags: needinfo?(mlee)
Keywords: perf
Priority: -- → P3
Whiteboard: [c= p= s= u=]
Do we also have number from an iPhone or Android phone?
I hosted the test https://github.com/digitarald/webogram-issue-40 at http://goo.gl/x2ZU2T, so it can be opened by a browser. I borrowed HTC Butterfly from :pwang as I don't have an Android or iPhone, here's what I got:

Flame (browser app, base image v122)
  8.052, 17.033, 20.678
HTC Butterfly (android builtin browser)
  6.829, 12.917, 17.741
Igor, the author of Webogram that hits this issue, did some more benchmarks comparing other libraries: https://github.com/zhukov/prime-factorization-benchmark

He cut the time down to 1-2 minutes on FxOS 1.1 using Big Integer library that causes less GC by design.
ting, just for reference, both test cases are also available as github pages: http://zhukov.github.io/prime-factorization-benchmark/ and http://digitarald.github.io/webogram-issue-40/ .
Oh, thank you :)
The profile doesn't seem to show any major JS issues for me.  About 63% of the time is spent in epoll_wait, presumably waiting for events, and another 26% of the time is spent inside PaintFrame.

Is there something I'm missing?
Flags: needinfo?(kvijayan)
The %63 epoll_wait is of b2g process event loop, if you click a bar of "GeckoMain (Browser)", you will switch to browser process and see 99.8% in pqPrimeLong().

The free memory was around 500MB while capturing.
Flags: needinfo?(kvijayan)
Good call.  Inverting the callstack on this shows GC taking up about 18% of the time for that region.  As I suspected earlier, the temporary allocations that goog.math.Long is doing for various operations (primarily pqPrimeLong, secondarily gcdLong) is causing us to GC a lot.

The best way of eliminating this is escape analysis.  The code is simple enough and small enough that a straightforward ad-hoc analysis should yield good results here.  Nicolas has been working on this for a while, I'm adding him to the needinfo.

Nicolas: can you use this as a test case and work towards landing our current ad-hoc escape analysis so we can see some real-world benefits out of the Recover-Instructions + Scalar-Replacement work that's been happening.
Flags: needinfo?(kvijayan) → needinfo?(nicolas.b.pierron)
(In reply to Kannan Vijayan [:djvj] from comment #18)
> The best way of eliminating this is escape analysis.  The code is simple
> enough and small enough that a straightforward ad-hoc analysis should yield
> good results here.  Nicolas has been working on this for a while, I'm adding
> him to the needinfo.
> 
> Nicolas: can you use this as a test case and work towards landing our
> current ad-hoc escape analysis so we can see some real-world benefits out of
> the Recover-Instructions + Scalar-Replacement work that's been happening.

Indeed, escape analysis should help for "b.and(…)" and "a.and(…)" in the conditions of the "gcdLong" function.  But I am afraid that the current implementation cannot go further than that as it only looks for non-escaped objects.  We could improve that with a DCE based approach, and reduce the number of allocations per-loop (shiftRight within gcdLong innermost while loops).

On the other hand, I bet GGC (see Bug 1020751) would also help a lot at reducing full GCs, and do more nursery collections.  In which case escape analysis will just be a way to prevent filling up the nursery too often.

I will keep the ni? and report the score on non-ggc and ggc flames next week.
(In reply to Nicolas B. Pierron [:nbp] from comment #19)
> I will keep the ni? and report the score on non-ggc and ggc flames next week.

Flame (browser app, inbound, GGC)
  3.201, 4.14, 7.057, 8.148, 8.386, 8.862, 9.424, 10.465, 10.644, 12.921, 25.527, 32.322, 34.047

Flame (browser app, inbound, no GGC)
  5.545, 8.684, 9.667, 11.378, 14.332, 14.604, 15.1, 15.397, 19.176, 19.565, 21.378, 30.07, 31.606


As the variance is strangely high I sorted the results to the different buckets of results. I highly recommend doing more than 3 runs.  This variance is likely caused by the scheduling of GCs.  So GGC indeed seems to be improving the overall score, even if they are some slow runs too.
Flags: needinfo?(nicolas.b.pierron)
No longer blocks: 1064358
Depends on: 1064358
(In reply to Kannan Vijayan [:djvj] from comment #18)
> Nicolas: can you use this as a test case and work towards landing our
> current ad-hoc escape analysis so we can see some real-world benefits out of
> the Recover-Instructions + Scalar-Replacement work that's been happening.

I made a shell version of the webpage and tested it on my laptop with 4 different configurations:
 1/ without GGC, without escape analysis.
 2/ Without GGC, with escape analysis
 3/ With GGC, without escape analysis
 4/ With GGC, with escape analysis

Each configuration was started 200 times (in a new shell each time) and the mean is taken are reported here:

1: 1.53 seconds
2: 0.94 seconds, 38.2% better than (1)
3: 0.40 seconds, 73.7% better than (1)
4: 0.37 seconds,  8.2% better than (3)

Note that escape analysis still matters even after GGC as we are not filling the nursery as frequently and we are doing less writes to memory.

The above escape analysis is based on the patches from Bug 1064358.
That's awesome!  Looks like GGC + Escape Analysis has a lot to offer.
Depends on: 1069307
(In reply to Nicolas B. Pierron [:nbp] from comment #21)
>  1/ without GGC, without escape analysis.
>  4/ With GGC, with escape analysis
> 
> 1: 1.53 seconds
> 4: 0.37 seconds

GGC and Escape analysis are both on by default on all platforms.
I do not think it would be safe to backport any of these optimizations to older branches.

Thus, I think we can close this bug.
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(hv1989)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.