[HTML5][Patch] Make the HTML5 parser not regress tp4

RESOLVED FIXED

Status

()

Core
HTML: Parser
P1
normal
RESOLVED FIXED
9 years ago
8 years ago

People

(Reporter: hsivonen, Assigned: hsivonen)

Tracking

({perf})

Trunk
x86
All
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 4 obsolete attachments)

Currently, the trunk tp4 score is 499.41 on Linux, 652.76 on WinNT 6.0, 378.67 on WinNT 5.1 and 652.39 on Mac.

The scores for the HTML5 parser are 520.22 on Linux and 463.66 on WinNT 5.1. So about 4% worse on Linux and 22% worse on WinNT 5.1.

This needs to be fixed.
Priority: -- → P1
Looks like the XP numbers fluctuate. Numbers for runs that have more constrained changes, so the differences should really be the parser or randomness:

      Old    HTML5  Change
Linux 499.46 520.22 + 4%
Mac   631.97 736.44 +17%
XP    451.29 463.66 + 3%
Vista 770.96 677.52 -12%

So Linux is 4% slower and XP is 3% slower, which looks like a gap that can be closed. Oddly, the HTML5 parser makes things 12% faster on Vista and 17% slower on Mac. The figure for Mac looks troubling.
Depends on: 506090
So the Vista numbers were too good to be true. Here an new numbers:

      Old     HTML5   Change
XP    437.91  461.61   5%
Vista 678.35  742.69   9%
Mac   634.57  722.96  14%
Linux 500.29  522.45   4%
CPU% counters are collected on Windows, but they go both ways...

Old XP:
    * tp4: 437.91 (details)
    * tp4_%cpu: 63.83 (details)
    * tp4_pbytes: 73.8MB (details)
    * tp4_memset: 78.0MB (details)
    * tp4_shutdown: 1691.0 (details)
    
Old Vista
    * tp4: 678.35 (details)
    * tp4_%cpu: 48.26 (details)
    * tp4_pbytes: 87.0MB (details)
    * tp4_memset: 72.4MB (details)
    * tp4_shutdown: 16509.0 (details)

HTML5 XP
    * tp4: 461.61 (details)
    * tp4_%cpu: 57.81 (details)
    * tp4_pbytes: 75.9MB (details)
    * tp4_memset: 80.4MB (details)
    * tp4_shutdown: 1715.0 (details)

HTML5 Vista
    * tp4: 742.69 (details)
    * tp4_%cpu: 49.42 (details)
    * tp4_pbytes: 84.9MB (details)
    * tp4_memset: 73.8MB (details)
    * tp4_shutdown: 13421.0 (details)
I did an analysis of the per-page numbers. Observations:
 * The HTML5 parser has some ridiculously bad outlier runs, especially on Vista.
 * Looking at *medians*, there are 8 pages that are faster with the HTML5 parser on all four platforms, 18 that are slower with the HTML5 parser on all four platforms and 74 pages that are faster *or* slower depending on the platform.
 * Mac is overall slower even in medians.
Profiled tp4 with Shark:
 * The parser thread takes 0.9% of the time.
 * _handleWindowNeedsDisplay takes notably more time 11.3% vs. 6.9%
 * ProcessGeckoEvents takes a little more time 7.3% vs. 6.8%
 * nsPreloadURIs::Run takes less time than the HTML5 parser's speculative runnables.
 * PresShell::ReflowEvent and DispatchContentLoadedEvents take more time with the HTML5 parser enabled.
 * InputStreamReadyEvent does more stuff in with the HTML5 parser. Particularly for imgrequests.
Main thread also serving as the parser thread:

Linux
    * tp4: 515.93 (details)
    * tp4_pbytes: 138.5MB (details)
    * tp4_rss: 45.9MB (details)
    * tp4_shutdown: 889.0 (details)
    
XP
    * tp4: 454.03 (details)
    * tp4_%cpu: 55.71 (details)
    * tp4_pbytes: 75.5MB (details)
    * tp4_memset: 81.2MB (details)
    * tp4_shutdown: 2087.0 (details)
    
Vista
    * tp4: 753.73 (details)
    * tp4_%cpu: 49.72 (details)
    * tp4_pbytes: 85.5MB (details)
    * tp4_memset: 74.7MB (details)
    * tp4_shutdown: 12051.0 (details)
    
Mac
    * tp4: 797.48 (details)
    * tp4_pbytes: 460.5MB (details)
    * tp4_rss: 132.6MB (details)
    * tp4_shutdown: 1234.0 (details)
Setting responsiveness timers to large values:

Linux
    * tp4: 519.27 (details)
    * tp4_pbytes: 136.3MB (details)
    * tp4_rss: 45.8MB (details)
    * tp4_shutdown: 956.0 (details)

Mac
    * tp4: 722.51 (details)
    * tp4_pbytes: 720.1MB (details)
    * tp4_rss: 134.7MB (details)
    * tp4_shutdown: 1105.0 (details)

XP
    * tp4: 455.56 (details)
    * tp4_%cpu: 57.0 (details)
    * tp4_pbytes: 76.9MB (details)
    * tp4_memset: 81.8MB (details)
    * tp4_shutdown: 1573.0 (details)

Vista
    * tp4: 883.42 (details)
    * tp4_%cpu: 44.1 (details)
    * tp4_pbytes: 85.5MB (details)
    * tp4_memset: 77.4MB (details)
    * tp4_shutdown: 20914.0 (details)
Avoiding flushing append notifications form controls

Linux
    * tp4: 521.01 (details)
    * tp4_pbytes: 134.4MB (details)
    * tp4_rss: 44.7MB (details)
    * tp4_shutdown: 834.0 (details)

XP
    * tp4: 459.08 (details)
    * tp4_%cpu: 59.22 (details)
    * tp4_pbytes: 78.2MB (details)
    * tp4_memset: 82.7MB (details)
    * tp4_shutdown: 2620.0 (details)

Vista
    * tp4: 781.15 (details)
    * tp4_%cpu: 47.0 (details)
    * tp4_pbytes: 85.2MB (details)
    * tp4_memset: 74.6MB (details)
    * tp4_shutdown: 9894.0 (details)

Mac
    * tp4: 721.6 (details)
    * tp4_pbytes: 710.8MB (details)
    * tp4_rss: 135.2MB (details)
    * tp4_shutdown: 871.0 (details)
No metascan

Linux
    * tp4: 505.73 (details)
    * tp4_pbytes: 136.0MB (details)
    * tp4_rss: 46.1MB (details)
    * tp4_shutdown: 929.0 (details)

XP
    * tp4: 461.44 (details)
    * tp4_%cpu: 55.36 (details)
    * tp4_pbytes: 75.5MB (details)
    * tp4_memset: 82.1MB (details)
    * tp4_shutdown: 1572.0 (details)

Vista
    * tp4: 803.3 (details)
    * tp4_%cpu: 48.31 (details)
    * tp4_pbytes: 83.1MB (details)
    * tp4_memset: 77.7MB (details)
    * tp4_shutdown: 19504.0 (details)

Mac
    * tp4: 706.59 (details)
    * tp4_pbytes: 694.3MB (details)
    * tp4_rss: 134.8MB (details)
    * tp4_shutdown: 1061.0 (details)
With the patch from bug 538087 applied:

Linux
    * tp4: 490.74 (details)
    * tp4_pbytes: 136.6MB (details)
    * tp4_rss: 46.3MB (details)
    * tp4_shutdown: 1037.0 (details)

XP
    * tp4: 432.29 (details)
    * tp4_%cpu: 54.83 (details)
    * tp4_pbytes: 76.5MB (details)
    * tp4_memset: 81.2MB (details)
    * tp4_shutdown: 1939.0 (details)

Vista
    * tp4: 816.84 (details)
    * tp4_%cpu: 46.47 (details)
    * tp4_pbytes: 88.5MB (details)
    * tp4_memset: 82.6MB (details)
    * tp4_shutdown: 8564.0 (details)

Mac
    * tp4: 688.91 (details)
    * tp4_pbytes: 712.4MB (details)
    * tp4_rss: 136.2MB (details)
    * tp4_shutdown: 963.0 (details)

Looks really good on Linux and XP and weird on Vista.
Depends on: 538087
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Created attachment 429116 [details] [diff] [review]
Instead of returning to the event loop after a </script>, loop within the treeop executor

Notes:
 * The code still posts a runnable after each </script> albeit a different runnable. This seems to be sufficient to make the parser never slip to doom because one of the runnables returns early. It isn't clear to me what would be strictly necessary to avoid letting the parser slip to doom (returning to the event loop before the parser is done but without anything ever calling to the parser again).

 * The current value for the deadline pref in all.js is deliberately over the top. It needs tuning.

 * As previously discussed, varying the deadline depending on the recentness of a user event is left to a follow-up patch.
Comment on attachment 429116 [details] [diff] [review]
Instead of returning to the event loop after a </script>, loop within the treeop executor

Please pretend that the deadline pref is set to 750 instead of 4000.

This patch unregresses tp4 on Vista and Linux, but not on Mac. (XP not available on the tryserver.)

I'm running out of ideas on the Mac. I'll reprofile in Shark.
Attachment #429116 - Flags: review?(bnewman)
Comment on attachment 429116 [details] [diff] [review]
Instead of returning to the event loop after a </script>, loop within the treeop executor

I'll try something else still.
Attachment #429116 - Flags: review?(bnewman)
Created attachment 429524 [details] [diff] [review]
Loop within the treeop executor, use the timing machinery from nsContentSink

Known problem: The "deflection count" of 200 is tuned for number of tokens with the old parser. Now it's used for the number of tree ops. Setting it to 100 or so might be better.
Attachment #429116 - Attachment is obsolete: true
Created attachment 430027 [details] [diff] [review]
Loop within the treeop executor, use the timing machinery from nsContentSink, rewrite speculative loads

When speculative loads were runnables, looping within the tree op executor made it possible for the loop to execute tree ops before their corresponding speculative load runnables had run. This caused badness in test_bug503481b.html.

To make sure that speculative loads always start before the corresponding real tree ops run (even if the time distance of the two were so narrow that the speculations aren't really useful anymore), I put speculative loads into an explicit queue that always gets flushed before the op queue.

The code for the new speculative load queue is modeled on the structure of the tree op stage.
Attachment #429524 - Attachment is obsolete: true
Attachment #430027 - Flags: review?(bnewman)
Created attachment 430028 [details] [diff] [review]
Loop within the treeop executor, use the timing machinery from nsContentSink, rewrite speculative loads, fix bogus line

Oops. I had forgotten to remove a bogus line about deflections.
Attachment #430027 - Attachment is obsolete: true
Attachment #430028 - Flags: review?(bnewman)
Attachment #430027 - Flags: review?(bnewman)
Tryserver results:
Vista:
    * tp4: 661.19 (details)
    * tp4_%cpu: 50.96 (details)
    * tp4_pbytes: 95.7MB (details)
    * tp4_memset: 91.2MB (details)
    * tp4_shutdown: 6548.0 (details)

XP:
    * tp4: 423.46 (details)
    * tp4_%cpu: 59.71 (details)
    * tp4_pbytes: 77.8MB (details)
    * tp4_memset: 80.8MB (details)
    * tp4_shutdown: 1566.0 (details)

Mac:
    * tp4: 687.56 (details)
    * tp4_pbytes: 698.2MB (details)
    * tp4_rss: 134.9MB (details)
    * tp4_shutdown: 854.0 (details)

Linux:
    * tp4: 471.42 (details)
    * tp4_pbytes: 141.1MB (details)
    * tp4_rss: 45.7MB (details)
    * tp4_shutdown: 902.0 (details)

Yay. Now better than the old parser on all platforms.
Summary: [HTML5] Make the HTML5 parser not regress tp4 → [HTML5][Patch] Make the HTML5 parser not regress tp4
No longer blocks: 541040
Attachment #430028 - Flags: review?(bnewman) → review?(jonas)
Still trying to understand how the speculative loading works. Moving away from runnables scares me. What will cause us to get in to RunFlushLoop so that we call FlushLoads while we're otherwise stalled waiting for a slow external script waiting to load?

Does FlushLoads/mLoadStage/mLoadQueue etc only deal with speculative loads? If so, please name them FlushSpeculativeLoads/mSpeculativeLoadStage etc. And please add comments for these member variables to describe how they are used.

The flow is also somewhat confusingly named. With one FlushLoads function that moves ops from "queue" to "stage", and another FlushLoads function that moves them from "stage" to "queue". Simply using better names here would likely help.

Is there a reason to speculatively load manifests? First of all they are very rare today, and when they do appear, they generally should appear before any <script>s, thus before we'll ever block, right?
Ok, I think I'm understanding how the speculation is supposed to happen now. Sorry to be arguing about names, but I find that good naming tends to make it a lot easier to understand new code piecewise without having to fit it all in my head before understanding any of it.

So first to confirm that I've actually understood things:

The idea is that we have two sets of queues, the 'op' queue and the 'load' queue. Where the 'load' queue is where we stick all the requests for speculative loads.

In order to prevent first loading a resource, and then speculatively loading it (which would hit the network twice, and cause test_bug503481b.html to fail), we always transfer things in the 'load' queue to the main thread before we transfer things in the 'op' queue to the main thread. And we always process the entire 'load' queue before we transfer anything in the 'op' queue.

Is this correct?

It seems like there is one race condition which could cause us do the real load before the speculative load still. Is there a risk that writing to 'load' queue and the 'op' queue happens while the main thread is in RunFlushLoop after flushing the 'load' queue, but before flushing the 'op' queue?

I suspect the simplest thing would be to keep a hash in the script loader of all loaded scripts and make sure that we never speculatively load something that we've done a real load for. Alternatively, make sure that any time we do a real load, make sure to go through the 'load' queue and purge any pending speculative events for the same uri.
(In reply to comment #18)
> Still trying to understand how the speculative loading works. Moving away from
> runnables scares me. What will cause us to get in to RunFlushLoop so that we
> call FlushLoads while we're otherwise stalled waiting for a slow external
> script waiting to load?

There's also a dedicated runnable for calling FlushLoads only. That runnable will call FlushLoads before RunFlushLoop does when "ops" are actually stalled.

> Does FlushLoads/mLoadStage/mLoadQueue etc only deal with speculative loads? If
> so, please name them FlushSpeculativeLoads/mSpeculativeLoadStage etc. And
> please add comments for these member variables to describe how they are used.

OK. Lately, I've been erring on the side of shorter names to avoid the over 80 characters-wide lines that I get with verbose names.

> The flow is also somewhat confusingly named. With one FlushLoads function that
> moves ops from "queue" to "stage", and another FlushLoads function that moves
> them from "stage" to "queue". Simply using better names here would likely help.

In the case of both "ops" and "loads", there are three queues: one on the main thread, one on the parser thread and one in a staging area in between.

> Is there a reason to speculatively load manifests? First of all they are very
> rare today, and when they do appear, they generally should appear before any
> <script>s, thus before we'll ever block, right?

"Speculative" manifest loads aren't truly speculative--if a manifest gets loaded, we are committed to it. There can never be a <script> before the manifest, so the situation of having to undo a manifest due to document.write() never arises. The reason why a parser thread-discovered manifest gets loaded via "loads" as opposed to "ops" is that the manifest must get processed before any actually speculative loads such as scripts. Thus, manifests seen by the parser thread have to maintain the queue order relative to true speculative loads. See bug 541079.

(Aside: Fixing bug 543062 will have to involve bypassing the "ops" for manifests in document.open()ed docs, too.)

(In reply to comment #19)
> Ok, I think I'm understanding how the speculation is supposed to happen now.
> Sorry to be arguing about names, but I find that good naming tends to make it a
> lot easier to understand new code piecewise without having to fit it all in my
> head before understanding any of it.
> 
> So first to confirm that I've actually understood things:
> 
> The idea is that we have two sets of queues, the 'op' queue and the 'load'
> queue. Where the 'load' queue is where we stick all the requests for
> speculative loads.
> 
> In order to prevent first loading a resource, and then speculatively loading it
> (which would hit the network twice, and cause test_bug503481b.html to fail), we
> always transfer things in the 'load' queue to the main thread before we
> transfer things in the 'op' queue to the main thread. And we always process the
> entire 'load' queue before we transfer anything in the 'op' queue.
> 
> Is this correct?

Correct.

> It seems like there is one race condition which could cause us do the real load
> before the speculative load still. Is there a risk that writing to 'load' queue
> and the 'op' queue happens while the main thread is in RunFlushLoop after
> flushing the 'load' queue, but before flushing the 'op' queue?

Looks like there is in the case where the parser thread is not parsing ahead after a </script> but is transferring data to the main thread through the op stage. Oops...
 
> I suspect the simplest thing would be to keep a hash in the script loader of
> all loaded scripts and make sure that we never speculatively load something
> that we've done a real load for. Alternatively, make sure that any time we do a
> real load, make sure to go through the 'load' queue and purge any pending
> speculative events for the same uri.

It seems to me a simpler fix would be unifying the load stage and the op stage into one object with one mutex and making the main thread read a single method for the case where ops are read from stage.
One more thing:
> In the case of both "ops" and "loads", there are three queues: one on the main
> thread, one on the parser thread and one in a staging area in between.

For "ops" the staging area in only used when the parser thread is transferring data to the main thread without setting it aside in a speculation. When a speculation has been set aside or when processing document.write()-created ops, there's no point in using the staging area, so it's not used in those cases.
I would fairly strongly err on the side of descriptive names and fewer abstractions in order to make it more understandable what the code is doing by simply looking at it. So for example I'd also prefer to see the nsHtml5SpeculativeLoadStage class go away and have the treebuilder hold a strong reference to the nsHtml5TreeOpExecutor directly and have a lock living there which you explicitly grab. No need to go through a level of abstraction IMHO.

(In reply to comment #20)
> In the case of both "ops" and "loads", there are three queues: one on the main
> thread, one on the parser thread and one in a staging area in between.

I think more descriptive names, as well as comments where the members are declared, would help with understanding this a lot.

> "Speculative" manifest loads aren't truly speculative--if a manifest gets
> loaded, we are committed to it. There can never be a <script> before the
> manifest, so the situation of having to undo a manifest due to document.write()
> never arises. The reason why a parser thread-discovered manifest gets loaded
> via "loads" as opposed to "ops" is that the manifest must get processed before
> any actually speculative loads such as scripts. Thus, manifests seen by the
> parser thread have to maintain the queue order relative to true speculative
> loads. See bug 541079.

Please document this thoroughly given that it's somewhat unexpected usage of the speculative queues.

> It seems to me a simpler fix would be unifying the load stage and the op stage
> into one object with one mutex and making the main thread read a single method
> for the case where ops are read from stage.

That would be ok too. Though eventually we'll probably want to use lock-less queues to transfer these objects from the parser thread to the main thread. But we can worry about that later.
(In reply to comment #22)
> I would fairly strongly err on the side of descriptive names and fewer
> abstractions in order to make it more understandable what the code is doing by
> simply looking at it. 

OK.

> So for example I'd also prefer to see the
> nsHtml5SpeculativeLoadStage class go away and have the treebuilder hold a
> strong reference to the nsHtml5TreeOpExecutor directly and have a lock living
> there which you explicitly grab. No need to go through a level of abstraction
> IMHO.

Surely for the purposes of this review, nsHtml5SpeculativeLoadStage can stay since this patch doesn't introduce it and its existence has been previously reviewed? The purpose of nsHtml5SpeculativeLoadStage is to avoid per-tree op locking on one hand and on the other hand make it obvious which members are protected by the mutex. Furthermore, since nsHtml5TreeOpExecutor is itself a lockless tree op sink, there needs to be another object that is the lock-protected tree op sink.

> (In reply to comment #20)
> > "Speculative" manifest loads aren't truly speculative--if a manifest gets
> > loaded, we are committed to it. There can never be a <script> before the
> > manifest, so the situation of having to undo a manifest due to document.write()
> > never arises. The reason why a parser thread-discovered manifest gets loaded
> > via "loads" as opposed to "ops" is that the manifest must get processed before
> > any actually speculative loads such as scripts. Thus, manifests seen by the
> > parser thread have to maintain the queue order relative to true speculative
> > loads. See bug 541079.
> 
> Please document this thoroughly given that it's somewhat unexpected usage of
> the speculative queues.

OK.
 
> > It seems to me a simpler fix would be unifying the load stage and the op stage
> > into one object with one mutex and making the main thread read a single method
> > for the case where ops are read from stage.
> 
> That would be ok too.

I'll fix it that way.
Attachment #430028 - Attachment is obsolete: true
Attachment #430028 - Flags: review?(jonas)
Created attachment 432779 [details] [diff] [review]
Address comments from sicking
Attachment #432779 - Flags: review?(bnewman)
(In reply to comment #24)
> Created an attachment (id=432779) [details]
> Address comments from sicking

My usual technique for getting a reliable interdiff (apply one patch, use patch -R to reverse it, then use patch to apply the other patch, then hg diff) doesn't seem to be producing correct results, probably due to bit-rot.

Would it be possible for you to post an interdiff, Henri?
Created attachment 433886 [details] [diff] [review]
Interdiff for addressing sicking's comments

Here's the interdiff.
Comment on attachment 432779 [details] [diff] [review]
Address comments from sicking

These changes look sound to me, and appear to address the points raised in the discussion above.

My only suggestion is that you stick to the same style between MoveOpsAndSpeculativeLoadsTo, MoveSpeculativeLoadsFrom, and MoveSpeculativeLoadsTo with respect to early return vs. else clause:

diff --git a/parser/html/nsHtml5TreeOpStage.cpp b/parser/html/nsHtml5TreeOpStage.cpp
--- a/parser/html/nsHtml5TreeOpStage.cpp
+++ b/parser/html/nsHtml5TreeOpStage.cpp
@@ -79,22 +79,22 @@ nsHtml5TreeOpStage::MoveSpeculativeLoads
 {
   mozilla::MutexAutoLock autoLock(mMutex);
   if (mSpeculativeLoadQueue.IsEmpty()) {
     mSpeculativeLoadQueue.SwapElements(aSpeculativeLoadQueue);
-    return;
+  } else {
+    mSpeculativeLoadQueue.MoveElementsFrom(aSpeculativeLoadQueue);
   }
-  mSpeculativeLoadQueue.MoveElementsFrom(aSpeculativeLoadQueue);
 }
 
 void
 nsHtml5TreeOpStage::MoveSpeculativeLoadsTo(nsTArray<nsHtml5SpeculativeLoad>& aSpeculativeLoadQueue)
 {
   mozilla::MutexAutoLock autoLock(mMutex);
   if (aSpeculativeLoadQueue.IsEmpty()) {
     mSpeculativeLoadQueue.SwapElements(aSpeculativeLoadQueue);
-    return;
+  } else {
+    aSpeculativeLoadQueue.MoveElementsFrom(mSpeculativeLoadQueue);
   }
-  aSpeculativeLoadQueue.MoveElementsFrom(mSpeculativeLoadQueue);
 }
 
 #ifdef DEBUG
 void
Attachment #432779 - Flags: review?(bnewman) → review+
(In reply to comment #27)
> (From update of attachment 432779 [details] [diff] [review])
> These changes look sound to me, and appear to address the points raised in the
> discussion above.
> 
> My only suggestion is that you stick to the same style between
> MoveOpsAndSpeculativeLoadsTo, MoveSpeculativeLoadsFrom, and
> MoveSpeculativeLoadsTo with respect to early return vs. else clause:

Thanks!

Pushed with 'else' instead of early 'return' in nsHtml5TreeOpStage.cpp:
http://hg.mozilla.org/mozilla-central/rev/61fb9e7374eb
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED

Updated

8 years ago
Keywords: perf
You need to log in before you can comment on or make changes to this bug.