Open Bug 1152893 Opened 9 years ago Updated 2 years ago

[meta] Investigate remaining performance regressions in promises due to async stacks

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect

Tracking

()

Tracking Status
firefox39 + fixed
firefox40 + fixed

People

(Reporter: bzbarsky, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Keywords: meta, perf, regression)

[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

Promises currently capture stacks in all sorts of situations.  When we put that in, we decided stack capture was cheap enough that it was OK.

But now with async stacks that no longer seems to be true.  In particular, on the testcase in bug 1152875 I see stack capture taking up 40% of the runtime, and it's completely dominated by SavedStacks::adoptAsyncStack.  Specifically, of the 1508ms spent under Promise::CaptureStack in my profile, 1270ms is spent under adoptAsyncStack.

We need to either make this faster somehow or disable promise stack captures by default or something.

STEPS TO REPRODUCE:

1)  git clone https://github.com/wanderview/streams-promise-read
2)  Edit streams-promise-read/test.js and change the argument to
    executePromise() from 10 to 10000.
3)  Run streams-promise-read/index.html and note the promise case ops/s.
4)  Disable the adoptAsyncStack call in SavedStacks::insertFrames.
5)  Repeat step 3.

I'm observing a quite noticeable speedup in step 5 compared to step 3...
Note, you don't have to edit the test to change chunks any more.  You can just browse:

  blog.wanderview.com/streams-promise-read/index.html?chunks=10000

Hope that helps.
(1) Since all the JSAPI methods for working with SavedFrames handle wrappers, could we relax the requirement that the async parent be a SavedFrame in the async child's compartment?

(2) What if async parent were a weak reference?

If we had both (1) and (2), then we could just set the async parent pointer directly, and avoid copying the N youngest async frames into the async child's compartment.

My hope is that (2) wouldn't result in too many missing async stacks, but that isn't clear to me.
Blocks: 1152875
> could we relax the requirement that the async parent be a SavedFrame in the async
> child's compartment?

Note that in this case everything is same-compartment.
(In reply to Not doing reviews right now from comment #3)
> > could we relax the requirement that the async parent be a SavedFrame in the async
> > child's compartment?
> 
> Note that in this case everything is same-compartment.

Sure, but that isn't always the case.

The two things adoptAsyncStack does is (1) ensure async stacks are in the child's compartment and (2) enforce the frame limit so we don't keep too many SavedFrame objects alive. We need to have a story for both those things if we want to avoid this call.
(In reply to Nick Fitzgerald [:fitzgen] from comment #4)
> The two things adoptAsyncStack does is (1) ensure async stacks are in the
> child's compartment

For this, we could have a fast path when the number of frames is below the limit and the compartment is already the same, but this isn't going to help with the test case in question.

> (2) enforce the frame limit so we don't keep too
> many SavedFrame objects alive.

That's the main thing the function does in the streams case, and I don't think we can avoid it.

There might some extra allocation time that can be avoided by increasing the vector size ("typedef Vector<Lookup, 60> LookupVector;"), but I guess most of the time is spent in the hash lookup phase for finding the same pointer to the same chain of 60 frames, right?

We can try reducing the limit to 20 frames or less, this should reduce the lookup time linearly.

I believe there could be a better algorithm for handling stable recursion - something that could allow, for example, given 4 real frames and an async stack reference, to understand immediately that we're recursing in the same way we did before, and return the right truncated async stack without having to look up the hash of the other 56 frames. I couldn't come up with something off the top of my head, but maybe Jim has thoughts on this?
Perhaps implementing bug 1028418 would help here? We would still pay the adoptAsyncStack cost for the very first async stack capture, but after that it would be cached.

I'm not a fan of reducing the stack depth to 20 -- it doesn't seem like enough frames. Especially when you consider that this is 20 frames *captured* not shown to the user. Self hosted frames and chrome frames get filtered out of display, so this is potentially many fewer frames shown.
> Perhaps implementing bug 1028418 would help here?

Could be!

I'm happy to measure if patches appear.
Nick, is this something you are planning to work on, or do you think we'll disable this in beta 39?   How important is this to go into 39 release, or what features does it support? Or, do we expect it to only stay enabled on dev edition/aurora?
Flags: needinfo?(nfitzgerald)
Paolo, what would be the impact to your work on our testing infra if we disabled this? See comment 9. I think you are in a better place to answer these questions.
Flags: needinfo?(nfitzgerald) → needinfo?(paolo.mozmail)
(In reply to Nick Fitzgerald [:fitzgen] from comment #10)
> Paolo, what would be the impact to your work on our testing infra if we
> disabled this?

To answer this question, the main advantage of async stacks in our testing infrastructure is better identification of failure causes, which amounts to staying enabled in Nightly at least during testing. They may be quite useful as a feature enabled by default in the Developer Edition as well.

I don't see a strong technical difficulty in not letting this ride Beta and Release, except that some affected tests would still need to handle both the enabled and disabled cases, and we'd see the difference in behavior only at merge time.

That said, let's not forget we're looking at a synthetic benchmark. These are important, sometimes developers look at these to make decisions, but what this benchmark seems to assume is that you're using Promises for a computationally-intensive loop, that I don't think is a good use of Promises anyways. (You're not really testing real-world streams if you're working in 1 bytes chunks... a better test would be a minimum 512 bytes chunks if not higher, and then you can compare the Promise overhead versus network transfer or local I/O.)

We're talking about trading a (questionable) performance gain with a very useful feature.

I'm not familiar with the pros and cons of designing for third-party synthetic benchmarks, so I may not be the best person to make the final decision. But maybe this is one of the cases where Beta would really give us some real world impressions and test cases to see the impact of async stacks on the web? Looks like a good service to users and developers who opted in to receive Beta updates.

We can then better decide if we want this on Release.
Flags: needinfo?(paolo.mozmail)
I'll let Ben answer how realistic the benchmark is, but we saw similar performance with larger chunks too, and at least some of the streams stuff involves the stream being entirely in memory...  I agree that we should try to come up with a more realistic testcase for the streams thing specifically, though.

Really, the streams stuff is deciding whether it can use promises at all, on performance grounds; it would be nice if the answer didn't turn out to be "no".
(In reply to Not doing reviews right now from comment #12)
> I'll let Ben answer how realistic the benchmark is, but we saw similar
> performance with larger chunks too, and at least some of the streams stuff
> involves the stream being entirely in memory...  I agree that we should try
> to come up with a more realistic testcase for the streams thing
> specifically, though.

Making a system level test with I/O is on my todo list.

I think its a bit unfair to say "Promises should only be used with blocking I/O".  We use async interfaces to avoid CPU jank all the time.

In terms of Streams, I agree that they will often be used with I/O.  But keep in mind they are designed to work for passing message objects and not just byte buffers.  If you have a transform stream that parses a 64kb buffer into 1000 message objects, the downstream consumer should be able to access those 1000 message objects with minimal overhead.  That parse step is conceptually a "no I/O" cpu bound task.

> Really, the streams stuff is deciding whether it can use promises at all, on
> performance grounds; it would be nice if the answer didn't turn out to be
> "no".

I agree.  We should do our best to minimize the performance penalty of using Promises over traditional async callbacks.

Do we do this async stack capture for callback functions?
(In reply to Ben Kelly [:bkelly] from comment #13)
> Do we do this async stack capture for callback functions?

Not yet, but we (devtools at least) would really like to.

So we need a long term solution here.

(In reply to :Paolo Amadini from comment #11)

So does Nightly-only sound like a good compromise to you, until we figure out the long term solution?
(In reply to Nick Fitzgerald [:fitzgen] from comment #14)
> So does Nightly-only sound like a good compromise to you, until we figure
> out the long term solution?

My take is that this feature should ride at least EARLY_BETA_OR_EARLIER.

(In reply to Not doing reviews right now from comment #12)
> Really, the streams stuff is deciding whether it can use promises at all, on
> performance grounds; it would be nice if the answer didn't turn out to be
> "no".

What is the alternative to using Promises? Any asynchronous operation or callback (setTimeout, setImmediate, and so on) is going to have async stacks, so it should not make a difference.

In fact, I would say we should go ahead and implement bug 1142577 (assuming it includes setImmediate). This would allow us to see the full impact of async stacks in real-world scenarios, without a bias towards non-native Promise libraries, and with the full scope of what we want to do.

Going to release with async stacks enabled by default and accessible to web content can be a differentiator. I wouldn't want to see us stopping at the first benchmark... because we know that no matter what we do this feature will have a performance (CPU, memory) regression in some scenarios.
(In reply to :Paolo Amadini from comment #15)
> What is the alternative to using Promises? Any asynchronous operation or
> callback (setTimeout, setImmediate, and so on) is going to have async
> stacks, so it should not make a difference.

Trying to decide between returning a promise on every read() or a synchronous read with a "ready" promise.

Its currently spec'd to return a promise on every read().

So the alternative still uses promises, just potentially a lot fewer of them.

> Going to release with async stacks enabled by default and accessible to web
> content can be a differentiator. I wouldn't want to see us stopping at the
> first benchmark... because we know that no matter what we do this feature
> will have a performance (CPU, memory) regression in some scenarios.

I can totally see how its a differentiator for web developers, but does it help end users who don't know anything about js?

It seems like this would be most useful in Dev Edition.  Could we add a devtools option to enable in beta/release?
To clarify: I think having async stacks enabled by default in Release would be a differentiator making Firefox look better for web developers, because not all web developers use the Developer Edition or know about all the features that can be enabled in devtools. Async stacks accessible to web content would work even with dump() and "new Error().stack" style debugging.

We don't know of any real-world downsides to doing that - until proven wrong. This may well happen as soon as this reaches Beta, or maybe not. (But we'd need bug 1142577 to ride on the same train, which may well mean we delay async stacks by one cycle to get more testing for it on Developer Edition where we may get early signals if this isn't the right path.)

(In reply to Ben Kelly [:bkelly] from comment #16)
> It seems like this would be most useful in Dev Edition.  Could we add a
> devtools option to enable in beta/release?

This is what Chrome does. Differentiator lost.
(By the way, I'd like to see all browsers having async stacks by default on Release soon...)
(In reply to :Paolo Amadini from comment #18)
> (By the way, I'd like to see all browsers having async stacks by default on
> Release soon...)

Are you planning to enable this on our mobile platforms?  The streams micro-benchmark ran significantly slower on a N5 in fennec and even slower on a b2g flame.  Have you done any performance measurements on these platforms?
(In reply to Ben Kelly [:bkelly] from comment #19)
> Are you planning to enable this on our mobile platforms?  The streams
> micro-benchmark ran significantly slower on a N5 in fennec and even slower
> on a b2g flame.

That's a good question - I realized too late I forgot to specify "Desktop browsers". There are definitely different memory as well as CPU constraints on mobile.

I was about to make a second aside but didn't want to spam the thread too much :-)

> Have you done any performance measurements on these platforms?

I assume Boris profiled this bug on Desktop.
> I assume Boris profiled this bug on Desktop.

That's correct.
I guess I'm less concerned if:

1) async stacks will apply to all async interfaces and not just promises
2) we'll be cautious about enabling on mobile devices

Somewhat unrelated, but do we collapse repeated frames in the async stacks?  Things like the Streams API that do async loops will have a lot of repeated frames.  Would this help performance at all?
No longer blocks: 1083359
Depends on: 1083359
Depends on: 1158133
Blocks: async-stacks
(In reply to Ben Kelly [:bkelly] from comment #22)
> 1) async stacks will apply to all async interfaces and not just promises

There are various bug on file for this as dependencies of bug 981514, I think setImmediate and setTimeout in bug 1142577 would be the minimum we need to assess performance.

> 2) we'll be cautious about enabling on mobile devices

Filed bug 1158133 to disable by default on mobile for the moment.

> Somewhat unrelated, but do we collapse repeated frames in the async stacks? 
> Things like the Streams API that do async loops will have a lot of repeated
> frames.  Would this help performance at all?

The stack chains have shared tails, so normally adding a few new frames is fast, whether they are repeated or not. The performance issue occurs when we need to a truncate a chain at a certain depth: assuming the new truncated chain looks exactly like the old one (stable recursion), we fully reuse the memory, but performing this optimization is expensive.

We may consider doing this less frequently, for example only when we are over 90 frames we truncate to 60 frames. If we don't take other measures, we'd have a variable maximum stack depth though. Or we can implement a more sophisticated folding of recursion.
Blocks: 1148593
One thing that would be handy is to provide devtools with a way to enable async stacks even if
they default to disabled in the platform.  That way, users who are debugging or looking at the
performance tool could decide to accept the cost of the feature in return for the (much) better
debugging it provides.
No longer blocks: 1148593
(In reply to Tom Tromey :tromey from comment #24)
> One thing that would be handy is to provide devtools with a way to enable
> async stacks even if
> they default to disabled in the platform.  That way, users who are debugging
> or looking at the
> performance tool could decide to accept the cost of the feature in return
> for the (much) better
> debugging it provides.

Yes, I would like to sprinkle this feature more thoroughly in DevTools code, and during my own browser development, typically I would much rather just have a working stack while developing, even if I have to pay for it with speed somewhat.  A pref or something to control this would be nice.
From bug 1158133, it looks like we are about to turn off async stacks for the beta and release channels on desktop, and for all mobile platforms.
Ben, do you think you can compare a benckmark on Aurora with one on Nightly with bug 1177508 applied?
Flags: needinfo?(bkelly)
You can run the tests by visiting:

  https://blog.wanderview.com/streams-promise-read/

And clicking on the "Async Read (Native Promises)" link.
Flags: needinfo?(bkelly)
Most of the original performance issues marked as dependencies of this bug have been fixed, but there are still a few optimizations we can make. I'm not aware that any team is working on these specifically at the moment, though.
Keywords: meta
Summary: Significant performance regressions in promises due to async stacks → Investigate remaining performance regressions in promises due to async stacks
Summary: Investigate remaining performance regressions in promises due to async stacks → [meta] Investigate remaining performance regressions in promises due to async stacks
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.