Closed Bug 1066701 Opened 10 years ago Closed 9 years ago

2% osx 10.6 session restore regression on inbound (v.34) August 19th from bug 1050340

Categories

(Core :: XPConnect, defect)

All
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

when bug 1050340 landed, we had a bump in osx 10.6 session restore numbers.  This is where we went from a range of 1980.0->1995.0 and now have a range of 1998.0->2020.0.

There have been many regressions and this is one of 3 which have caused a much more noticeable regression when we uplifted to Aurora.
I have done some retriggers to point out the issue:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&startdate=2014-08-18&enddate=2014-08-20&jobname=Rev4%20MacOSX%20Snow%20Leopard%2010.6%20mozilla-inbound%20talos%20other_nol64

and here is a graph to show the regression:
http://graphs.mozilla.org/graph.html#tests=[[313,52,21],[313,63,21],[313,64,21]]&sel=1407873538930,1410465538930&displayrange=30&datatype=running

:bholley, can you look at the set of patches and help figure out what is causing this regression?  

Ff you need more information about the test here is some documentation: https://wiki.mozilla.org/Buildbot/Talos/Tests#sessionrestore.2Fsessionrestore_no_auto_restore, and :Yoric would know any additional details.
Flags: needinfo?(bobbyholley)
Hey Yoric,

I'm swamped with e10s stuff, and am concerned about my ability to do a full investigation here a timely manner.

The patch I landed definitely would make structured cloning of cross-compartment objects slower. We generally shouldn't be doing this though, and should design the code such that any clones of large objects are initiated from the compartment in which that code lives. Do we do that kind of thing in Session Restore?
Flags: needinfo?(dteller)
Afaict, the only two structured clonings we use are:
- Cu.cloneInto(foo, {})
- Worker.prototype.postMessage.

Would the change impact one of these?
Flags: needinfo?(dteller)
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #4)
> Afaict, the only two structured clonings we use are:
> - Cu.cloneInto(foo, {})

This should be fine - any literal should be same-compartment. Though if you want to create an empty object, you should just do |new foo.Object()|.

> - Worker.prototype.postMessage.

This would be affected depending on whether anything is being postMessaged.

No indexedDB in session restore?
No indexedDB in Session Restore itself, but there could be some indexedDB calls measured accidentally by the benchmark. I realize that `postMessage` is only called later in Session Restore, so that's probably not impacted here.

> This should be fine - any literal should be same-compartment. Though if you want to create an empty object, you should just do |new foo.Object()|.

No, we're cloning large not-very-well-typed trees.
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #6)
> No, we're cloning large not-very-well-typed trees.

Ok. Is there a possibility of any cross-global edges in those trees? Can you link me to the code?
http://dxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/SessionStore.jsm#2501
and http://dxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/SessionStore.jsm#3095 – both of which are part of the somewhat hairy startup of Session Restore.

Note, however, that the benchmark captures everything that happens between the start and the end of Session Restoration, which may at times include quite unrelated stuff.
Bobby: would it be possible to run a profile of startup and find out where structured cloning is used?
Yes, it should be. I'm sorry I wasn't able to get to this before heading out on PTO. I'll try to make time when I get back, but it's getting a bit late so if you get the chance to take a look that would be appreciated.
In reply to Bobby Holley (:bholley) from comment #5)
> > - Cu.cloneInto(foo, {})
> This should be fine - any literal should be same-compartment. Though if you
> want to create an empty object, you should just do |new foo.Object()|.

Sorry, I was wrong, this is totally it. I mixed up the argument order, and thought that code was cloning |{}| into the scope of |foo|, but it's actually cloning |foo| into the scope of |{}|, which in this case crosses compartment boundaries.

I profiled cloning a bit - it's about 33% slower to perform over a cross-compartment wrapper. There are a few optimizations that could be made, but there's no satisfying low-hanging fruit. The overhead is all in JSStructuredClone::write:
* ~1 part hasOwnProperty-on-proxy (which it sounds like we could eliminate)
* ~2 parts getProperty-on-proxy
* ~1 part other stuff inside of the ensuing startWrite() call.

So we could fix this regression as it stands in either JS or C++. The JS solution would involve cloning the object into its own compartment instead of the caller's compartment (Cu.cloneInto(foo, foo)), since it's all System Principal anyway. The C++ solution would involve special-casing CrossCompartmentWrapper (which we know to be transparent).

Both of those are fine but neither makes me excited. The JS solution is black magic. The C++ solution complicates our invariants, which would be OK for a perf win on a known bottleneck, but it's not supposed to be a bottleneck (the cloning code in general isn't very well-optimized). It only becomes a bottleneck in the case where we're trying to clone 700k of sessionStore JSON.

If we could get rid of that clone call, we would presumably see a 6% win on this benchmark from the initial state (8% considering the regression in this bug). Yoric, do we _really_ need to be cloning 700k of JSON in those cloneInto calls, or can we be more clever?
Flags: needinfo?(bobbyholley) → needinfo?(dteller)
(In reply to Bobby Holley (:bholley) from comment #11)
> ...
> Both of those are fine but neither makes me excited. The JS solution is
> black magic. The C++ solution complicates our invariants, which would be OK
> for a perf win on a known bottleneck, but it's not supposed to be a
> bottleneck (the cloning code in general isn't very well-optimized). It only
> becomes a bottleneck in the case where we're trying to clone 700k of
> sessionStore JSON.
>

Considering this regression manifests apparently only on OS X 10.6, the fact that we've decided to ignore such 10.6-only regressions some time ago (bug 990490), the relatively low magnitude of the regression, and the quote above, IMO there are probably better ways to spend the time than to chase this specific regression.

 
> If we could get rid of that clone call, we would presumably see a 6% win on
> this benchmark from the initial state (8% considering the regression in this
> bug). Yoric, do we _really_ need to be cloning 700k of JSON in those
> cloneInto calls, or can we be more clever?

Would definitely be nice to get a win from ^, but IMO my previous paragraph still holds.

Bottom line - regressions are used to better understand the performance implications of patches. Once these implications are understood, fixing the regression or not is a judgment call by those who care most about this code, especially when the implications of the regression are relatively minor.
(In reply to Avi Halachmi (:avih) from comment #12)
> Considering this regression manifests apparently only on OS X 10.6, the fact
> that we've decided to ignore such 10.6-only regressions some time ago (bug
> 990490), the relatively low magnitude of the regression, and the quote
> above, IMO there are probably better ways to spend the time than to chase
> this specific regression.

This regression _does_ point us to a real issue, which is that sessionStore is cloning an enormous amount of JSON (enough to be take about 3ms on my top-of-the-line dev MBP). If we can find a way to avoid doing that clone, we should do it. If we can't, we might as well change Cu.cloneInto(foo, {}) to Cu.cloneInto(foo, foo), which is a very simple change and gives us a noticeable speedup.
Yeah, SessionStore is a, how shall we put it, large algorithm that was initially written without an data API. Since everything was done through undocumented side-effects on an undocumented data structure, its only protection against accidents was to `JSON.parse(JSON.stringify(...))` some or all of its data in some cases – in case you wonder, that's sometimes several Mb of stringifying/parsing.

While we have cleaned up lots of parts of Session Restore, we have not been able to actually introduce a clean separation between data and code yet, and I have spent too much energy trying to do it without result. The only thing we did here was replace JSON.parse(JSON.stringify(...)) by a somewhat saner Cu.cloneInto.
Flags: needinfo?(dteller)
In other words, yes, we should certainly change `Cu.cloneInto(foo, {})` into `Cu.cloneInto(foo, foo)`.
(In reply to Bobby Holley (:bholley) from comment #13)
> This regression _does_ point us to a real issue, which is that sessionStore
> is cloning an enormous amount of JSON (enough to be take about 3ms on my
> top-of-the-line dev MBP). If we can find a way to avoid doing that clone, we
> should do it. If we can't, we might as well change Cu.cloneInto(foo, {}) to
> Cu.cloneInto(foo, foo), which is a very simple change and gives us a
> noticeable speedup.

Sure, that's a very good understanding and the suggestion is definitely worth trying, as I mentioned earlier. My comment was mostly about fixing the change which caused the regression - not about not touching anything.

The bottom line was that it's a judgment call more than an obligation to fix the regression.
I realized that Cu.cloneInto(foo, foo) still won't work - it'd need to be Cu.getGlobalForObject(foo).cloneInto(foo, foo), I think, but testing locally doesn't seem to give the performance characteristics I'd expect.

I think I'm not going to spend anymore time on this.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.