Open Bug 1154987 Opened 9 years ago Updated 6 months ago

[Meta] Move JS parsing off of main thread and enable incremental parsing

Categories

(Core :: JavaScript Engine, defect)

x86_64
Windows 8.1
defect

Tracking

()

Performance Impact low

People

(Reporter: djvj, Unassigned)

References

(Depends on 2 open bugs, Blocks 2 open bugs)

Details

(Keywords: meta, perf, Whiteboard: [games:p3][platform-rel-Games])

Attachments

(1 file, 3 obsolete files)

Currently, we do all parsing of sync JS (i.e. most JS coming in from <script> tags) synchronously on the main thread.  This introduces jank on some sites (e.g. 50ms on gdocs), and generally slows things down because page execution gets blocked on parse which gets blocked on a full network read the JS code.

Luke and bz have discussed this for a bit, and it seems like it should be doable, and the gains would be good.

There are a few concerns to address:

1. We currently do off-main-thread parsing of async JS.  There are some high-overhead facets of this implementation.  For one, the off-main-thread parser modifies the atoms table directly, and thus necessitates a lock around the atoms table, which must be used by all parsers - both on main thread and on parse threads.  Also, an off-main-thread parse allocates memory into a special compartment (for shape and other allocations) and at the end of a background parse the arenas in that compartment get reparented into the main GC compartment.  This may be leading to high memory fragmentation caused by background parses.

An initial step is to measure the cost of gc-allocation in the parser, and eliminate it by having the parser keep its own special structures describing what objects need to be created, and only doing the object creation at 'merge' time when the parsed script is imported into the runtime.

Removing these allocations would reduce the overhead of any eventual shift of sync-js parsing to a background thread.

2. Synchronous JS parsing can then be moved to a background thread, but without any incremental parsing.  This would unblock the main thread from parsing, but without any of the benefits of parsing overlapping with network fetches.

3. Enable incremental parsing, which should improve parsing performance by interleaving it with network fetch.
Depends on: 1061886, 1084009
Note also some of the discussion in bug 1147048, both in terms of scriptloader bits and what JSAPI changes we might need.
Depends on: 1155754
I've done some measurements of the benefits of removing the atoms lock taken by the parser.  Since the shell execution of the octane test suite requires no background parsing whatsoever, I just disabled lock-taking in AtomizeChars and tested the improvement there.

I ran 50 octane runs with and without the change, and tabulated the results here:

               TEST   REF SCORE +- DEV       OPT SCORE +- DEV     PERCENT DIFF    SIGNIFICANT?
            Richards: 25332.12 +-  35.34     25246.52 +-  34.62   {diff = -0.34%} [!!!]
           DeltaBlue: 39556.17 +-  93.65     39613.12 +-  72.07   {diff =  0.14%} [   ]
              Crypto: 25450.60 +-  33.32     25506.48 +-  33.51   {diff =  0.22%} [   ]
            RayTrace: 88089.83 +-  88.81     88561.58 +- 102.40   {diff =  0.54%} [!!!]
         EarleyBoyer: 32429.67 +-  27.06     32399.56 +-  17.53   {diff = -0.09%} [   ]
              RegExp:  3052.25 +-   9.52      3025.23 +-   9.91   {diff = -0.89%} [!!!]
               Splay: 17316.48 +- 109.41     17425.56 +- 109.00   {diff =  0.63%} [   ]
        SplayLatency: 20629.88 +-  34.79     20532.62 +-  41.32   {diff = -0.47%} [!!!]
        NavierStokes: 33357.69 +-  20.38     33389.44 +-  15.95   {diff =  0.10%} [   ]
               PdfJS: 14734.15 +-  56.10     15248.25 +-  74.91   {diff =  3.49%} [!!!]
            Mandreel: 24921.58 +-  21.81     24928.94 +-  18.78   {diff =  0.03%} [   ]
     MandreelLatency: 26937.50 +- 251.60     27535.42 +- 316.73   {diff =  2.22%} [!!!]
             Gameboy: 52315.50 +- 225.32     51656.25 +- 274.46   {diff = -1.26%} [!!!]
            CodeLoad: 16603.58 +-  12.86     16626.54 +-  11.78   {diff =  0.14%} [   ]
               Box2D: 41214.67 +-  42.17     41214.92 +-  48.00   {diff =  0.00%} [   ]
                zlib: 71845.81 +- 108.25     72052.19 +-  87.97   {diff =  0.29%} [!!!]
          Typescript: 26737.04 +-  25.23     26807.31 +-  36.92   {diff =  0.26%} [!!!]
   Score (version 9): 26699.38 +-  16.39     26774.15 +-  17.62   {diff =  0.28%} [!!!]


The reference build (without the lock removal) scores are first, and the "optimized" build (with the lock removal) are second.  The results are averaged across all 50 (with the highest and lowest score removed for each), and the standard deviation calculated.  Tests where the reference score and optimized score differ by more than the standard deviation are flagged with [!!!].  This is a rough test for significance.

So, looking only at the significant differences:

            Richards: 25332.12 +-  35.34     25246.52 +-  34.62   {diff = -0.34%}
            RayTrace: 88089.83 +-  88.81     88561.58 +- 102.40   {diff =  0.54%}
              RegExp:  3052.25 +-   9.52      3025.23 +-   9.91   {diff = -0.89%}
        SplayLatency: 20629.88 +-  34.79     20532.62 +-  41.32   {diff = -0.47%}
               PdfJS: 14734.15 +-  56.10     15248.25 +-  74.91   {diff =  3.49%}
     MandreelLatency: 26937.50 +- 251.60     27535.42 +- 316.73   {diff =  2.22%}
             Gameboy: 52315.50 +- 225.32     51656.25 +- 274.46   {diff = -1.26%}
                zlib: 71845.81 +- 108.25     72052.19 +-  87.97   {diff =  0.29%}
          Typescript: 26737.04 +-  25.23     26807.31 +-  36.92   {diff =  0.26%}
   Score (version 9): 26699.38 +-  16.39     26774.15 +-  17.62   {diff =  0.28%}

Removing a lock should in no way lead to negative results, so the regressions we see are statistical aberrations.  Looking at the positive results, we see two significant rows:

               PdfJS: 14734.15 +-  56.10     15248.25 +-  74.91   {diff =  3.49%}
     MandreelLatency: 26937.50 +- 251.60     27535.42 +- 316.73   {diff =  2.22%}

Not sure if these are aberrations or truly significant, so I'm running another 50 iterations and will check if those same two show up with similar improvements in score.
Maybe try Parsemark as well? Not sure how relevant it is these days but I know njn has used it in the past to benchmark/optimize the parser...

https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Running_Parsemark
Another run of 50 produced the following results:

            Richards: 25305.54 +-  35.03     25214.85 +-  32.03   {diff = -0.36%} [!!!]
           DeltaBlue: 39315.33 +- 105.15     39382.08 +- 112.93   {diff =  0.17%} [   ]
              Crypto: 25465.08 +-  36.99     25464.19 +-  34.36   {diff = -0.00%} [   ]
            RayTrace: 88200.83 +-  95.61     88501.46 +- 114.22   {diff =  0.34%} [!!!]
         EarleyBoyer: 32400.04 +-  25.20     32378.25 +-  20.14   {diff = -0.07%} [   ]
              RegExp:  2996.71 +-  11.77      2999.98 +-  10.53   {diff =  0.11%} [   ]
               Splay: 17741.19 +-  92.06     17637.17 +- 102.90   {diff = -0.59%} [   ]
        SplayLatency: 20815.21 +-  35.83     20552.83 +-  34.52   {diff = -1.26%} [!!!]
        NavierStokes: 33390.31 +-  19.09     33393.58 +-  14.88   {diff =  0.01%} [   ]
               PdfJS: 14658.54 +-  70.57     15346.38 +-  52.36   {diff =  4.69%} [!!!]
            Mandreel: 24952.29 +-  21.66     24922.56 +-  21.03   {diff = -0.12%} [   ]
     MandreelLatency: 27760.31 +- 375.29     26882.27 +- 207.32   {diff = -3.16%} [!!!]
             Gameboy: 51853.75 +- 301.96     52235.35 +- 170.22   {diff =  0.74%} [   ]
            CodeLoad: 16578.81 +-  10.02     16610.56 +-  11.96   {diff =  0.19%} [!!!]
               Box2D: 41147.65 +-  53.69     41214.00 +-  44.44   {diff =  0.16%} [   ]
                zlib: 72343.04 +-  84.16     71980.71 +-  93.42   {diff = -0.50%} [!!!]
          Typescript: 26784.12 +-  32.18     26891.40 +-  24.71   {diff =  0.40%} [!!!]
   Score (version 9): 26745.33 +-  14.36     26770.29 +-  14.35   {diff =  0.09%} [   ]

The significant improvement in PdfJS scores is repeated, but MandreelLatency goes to a significant regression.  So the Latency is just more noisy than expected across runs.

Unifying these with the previous 50, this suggests a consistent score change in the following tests:

            Richards: 25332.12 +-  35.34     25246.52 +-  34.62   {diff = -0.34%} [!!!]
            Richards: 25305.54 +-  35.03     25214.85 +-  32.03   {diff = -0.36%} [!!!]

            RayTrace: 88089.83 +-  88.81     88561.58 +- 102.40   {diff =  0.54%} [!!!]
            RayTrace: 88200.83 +-  95.61     88501.46 +- 114.22   {diff =  0.34%} [!!!]

        SplayLatency: 20629.88 +-  34.79     20532.62 +-  41.32   {diff = -0.47%} [!!!]
        SplayLatency: 20815.21 +-  35.83     20552.83 +-  34.52   {diff = -1.26%} [!!!]

               PdfJS: 14734.15 +-  56.10     15248.25 +-  74.91   {diff =  3.49%} [!!!]
               PdfJS: 14658.54 +-  70.57     15346.38 +-  52.36   {diff =  4.69%} [!!!]

          Typescript: 26737.04 +-  25.23     26807.31 +-  36.92   {diff =  0.26%} [!!!]
          Typescript: 26784.12 +-  32.18     26891.40 +-  24.71   {diff =  0.40%} [!!!]

I'll run a third time tonight to narrow this down a bit more, but it's seeming like overall, we may see a few minor improvements and regressions (not sure why removing a lock would cause a regression, but I'll go with the data).. and a significant improvement in PdfJS times.

A third run tonight will bolster or knock down these results (I need to run on an idle box and doing it now prevents me from using it for normal development).
(In reply to Jan de Mooij [:jandem] from comment #4)
> Maybe try Parsemark as well? Not sure how relevant it is these days but I
> know njn has used it in the past to benchmark/optimize the parser...
> 
> https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/
> Running_Parsemark

I just ran those tests.  There are no meaningful speedups by removing the atoms table lock.  I picked the three largest files (pipio, sunspider, gmail) and ran tests with 2000 iterations on them, and used R to test the differences in the means, and the 95% confidence interval for the two calculated means include zero.

Pure parsing should be seeing the most benefit from this, so I'm leaning towards this telling us that the locking done for async access to the atoms table is negligible.
The beginnings of a patch.  I added a new XPCOM interface called nsIIncrementalStreamLoader, mimicking nsIStreamLoader for now.  For now the behaviour is exactly the same, but nsScriptLoader has been modified to use the new classes instead of the old.

Next step is to change the behaviour of nsIIncrementalStreamLoader so it safely sends data to an incremental JS parser.
Attached patch add-script-load-handler.patch (obsolete) — Splinter Review
This refactors the "handler" for the nsIncrementalStreamLoader into its own class, called a nsScriptLoadHandler, which takes over the responsibilities of implementing nsIIncrementalStreamLoaderObserver from nsScriptLoader.

It also adds a new API method 'OnIncrementalData' to nsIncrementalStreamLoaderObserver.  This is just a dummy function for now (empty implementation), but will be used to accumulate data in the future.

Now, nsScriptLoadHandler can be modified to store state within itself to incrementally decode data coming in from the StreamLoader, feeding a parser incrementally, and only invoking nsScriptLoader after the script has been downloaded and parsed.
Assignee: nobody → kvijayan
Attached patch add-character-decoding.patch (obsolete) — Splinter Review
Incrementally convert incoming data into a char buffer on the nsScriptLoadHandler.  Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=41ee3011546b
Depends on: 1167409
No longer blocks: AppStartup
Keywords: meta, perf
Depends on: 1218029
Comment on attachment 8601688 [details] [diff] [review]
add-incremental-stream-loader.patch

(Rebased and moved to bug 1218029)
Attachment #8601688 - Attachment is obsolete: true
Comment on attachment 8601772 [details] [diff] [review]
add-script-load-handler.patch

(Rebased and moved to bug 1218029)
Attachment #8601772 - Attachment is obsolete: true
Comment on attachment 8603519 [details] [diff] [review]
add-character-decoding.patch

(Rebased and moved to bug 1218029)
Attachment #8603519 - Attachment is obsolete: true
Depends on: 1222086
That's a WIP patch that I'm using to measure benefits of incrementally sending loaded to the off-thread parser (depends on bug 1222086). You can disable this by setting javascript.incrementalParsing.enabled (bool) = false. (The 50ms gdocs delay/block did not go away yet -- investigating more)
Depends on: 1236104
(In reply to Yury Delendik (:yury) from comment #15)
> The 50ms gdocs delay/block did not go away yet -- investigating more.

The pre-fetched/pre-cached scripts currently not off-thread compiled. See bug 1236104.
Whiteboard: [games:p1] → [games:p1][platform-rel-Games]
platform-rel: --- → ?
Marking this as lower games priority, since we are looking towards WebAssembly which will avoid this.
Whiteboard: [games:p1][platform-rel-Games] → [games:p3][platform-rel-Games]
platform-rel: ? → ---
Blocks: 1347824
Whiteboard: [games:p3][platform-rel-Games] → [qf][games:p3][platform-rel-Games]
Whiteboard: [qf][games:p3][platform-rel-Games] → [qf:p1][games:p3][platform-rel-Games]
Whiteboard: [qf:p1][games:p3][platform-rel-Games] → [qf:p3][games:p3][platform-rel-Games]
Assignee: kvijayan → nobody
Status: ASSIGNED → NEW
Performance Impact: --- → P3
Whiteboard: [qf:p3][games:p3][platform-rel-Games] → [games:p3][platform-rel-Games]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: