Loading https://doc.servo.org/smallvec/trait.Array.html spends multiple seconds in parsing and GC during page load

NEW
Unassigned

Status

()

defect
P2
normal
3 months ago
11 days ago

People

(Reporter: emilio, Unassigned, NeedInfo)

Tracking

(Blocks 1 bug, {perf})

unspecified
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox68 fix-optional)

Details

(Whiteboard: [qf:p1:pageload], )

Reporter

Comment 1

3 months ago

On a clean profile: http://bit.ly/2SloUct

Here is a profile I also collected which shows about 2s of parsing at the beginning: http://bit.ly/2Tk4RQs

Whiteboard: [qf] → [qf:p1:pageload]

Based on the latest Jason investigation,

This page is loading a multiple MB script which contains a single array with tons of literals and a few variables, which prevent us from using the fast-path for JSOP_OBJECT creation, but go through the slow-path of generating JSOP_INITELEM.

var a = 1;
var b = 2;
var arr = [1,2,3,4,5,6,7,8,9,0, a ,1,2,3,4,5,6,7,8,9,0, b ,1,2,3,4,5,6,7,8,9,0];

2 potential approaches:

  • It sounds that this is the kind of values where we could use the constant propagation of the parser to figure out the a & b variables.
  • Another option is to initialize the JSOP_OBJECT case, and only emit JSOP_INITELEM for the few variables cases which are detected while parsing.

Jason, does any sounds like something easy to do?

Flags: needinfo?(jorendorff)
Priority: -- → P2

I don't see a quick hit here. Naive constant-propagation would be unsound. If there's already a global readonly property a, var a = 1; does not change its value.

Flags: needinfo?(jorendorff)

Well, the first step is to avoid GC. So:

  1. Is the frontend triggering GC, by making a lot of garbage? (I sort of doubt it, but if so, something like comment 3 suggested approach 2 is the only way.)

  2. If not, is there any realistic way to avoid GC?

This style of code rapidly consumes a huge amount of memory but generates no garbage at run time (that is, all of the arrays are ultimately stored in a global variable, and thus rooted). But it would be unsound to assume that the code doesn't generate garbage.

jonco, any ideas?

Flags: needinfo?(jcoppeard)

The profile in comment 1 contains an 18-second event processing delay, of which:

  • 4.0sec or so in script parsing
  • 4.5sec or so in script emit
  • 5.5sec in GC
  • 0.2sec other execution

So if we can eliminate the long GC pause, the next idea is to move script parsing and emitting to a thread. This probably wouldn't reduce total time from start to DOMContentLoaded, but I think it would keep the page responsive while we're waiting.

This is happening while executing JS, so I don't think it's related to bug 1543806.

We're allocating all this data into the nursery and then tenuring it when we collect the nursery. It would be great if we could detect this situation and allocated this data in the tenured heap to start with. Pre-tenuring is based off the ObjectGroup, and I don't know how this works for literal objects like we have here. Do they all end up with separate groups? If so I don't see how to avoid this.

Looking at the profiles, we're seeing huge minor GC times. But digging in I see things like taking 40ms to allocate or clear a page of memory, which makes me think that maybe the system is swapping at this point? I don't understand why this is taking so long.

No longer depends on: 1543806

#jsapi discussion indicates the spec for <script defer> allows off-main-thread parsing.


<jorendorff> bz: ping - I have a profile in bug 1530212 that raises
    questions about <script defer>

    bz: in particular, do we have to run the script synchronously as soon as
    parsing is complete?

    that is, what if we instead started js-parsing the script as soon as
    possible, then once html-parsing is complete and all deferred script
    js-parsing is complete, run them all and then fire DOMContentLoaded

    that seems indistinguishable from the JS scripts just taking a long time to
    load, from content's perspective

<bz> jorendorff: You are correct

    jorendorff: Bas and jesup had already run into us not yielding while
    executing defer scripts...

    jorendorff: And yes, we should be doing exactly what you suggest: kick off
    the parse once we have the data and there's nothing more urgent going on,
    and just run the scripts once both its parsing and the HTML parser is done

    ...

    https://html.spec.whatwg.org/multipage/parsing.html#the-end:list-of-scripts-that-will-execute-when-the-document-has-finished-parsing

    Spec explicitly says you don't wait until all defer scripts are ready to
    run the first one

    As long as you run them in order

    And you can do whatever in between; see step 3.1 in the steps I linked.

    jorendorff: To be clear, each individual script would run to completion

    jorendorff: We can just preempt between them

<smaug> we preempt if the next script hasn't been loaded

<jesup> Preempting if the script "ran too long" before running the next defer
    script seems totally reasonable.  SetTimeout has a similar mechanism to
    avoid monopolizing the event queue and blocking other things from happening

<smaug> yup

    and parsing/compiling more on background threads sounds good too, once we
    can do that while GC runs

<tcampbell> matt is making progress on that. It is looking reasonable so far

<smaug> great

(In reply to Jon Coppeard (:jonco) from comment #7)

Looking at the profiles, we're seeing huge minor GC times. But digging in I see things like taking 40ms to allocate or clear a page of memory, which makes me think that maybe the system is swapping at this point? I don't understand why this is taking so long.

There's a lot I don't understand about these profiles.

In two of the three profiles, the samples don't come at regular intervals. (Unfortunately, the profiler UI still treats each sample as taking up just 1ms, which makes it hard to estimate how long things take.)

(In reply to Jason Orendorff [:jorendorff] from comment #9)

In two of the three profiles, the samples don't come at regular intervals. (Unfortunately, the profiler UI still treats each sample as taking up just 1ms, which makes it hard to estimate how long things take.)

Oh thanks, that explains what I'm seeing. This also happens when I profile locally. It looks like I get one sample every ~6ms, with reported total running/self time correspondingly reduced.

I still don't know why we're getting these long minor GCs. When I run this code in the shell I can see long minor GCs of ~15ms to evict a full 16MB nursery. When run in the browser I can see e.g. a 672ms minor GC (!!). As far as I can see in the profile, nothing else is running though.

You need to log in before you can comment on or make changes to this bug.