Closed Bug 1395652 Opened 2 years ago Closed 2 years ago

Find out which changes introduced in bug 1358060 cause regressions themselves

Categories

(Core :: Networking: HTTP, defect, P1)

defect

Tracking

()

RESOLVED INVALID
Tracking Status
firefox57 --- affected

People

(Reporter: mayhemer, Assigned: xeonchen)

References

(Blocks 1 open bug)

Details

To explain: bug 1358060 introduces a pref'ed feature to tail tracker requests (put them to a special bucket with very low priority within the page load context).

There were also few changes I found (via local testing) be important for the feature to work correctly and most efficiently.

However, WPT/presto testing revealed that the changes themselves cause some load time regressions (when the tailing feature is otherwise pref'ed off).

(To avoid these regressions in the wild, in case we would want to disable tailing with the pref, bug 1395525 has been filed and patched - it surrounds those changes with the tail-enable pref)


We should invest some effort to finding out which of the changes are behind those WPT/presto regressions and update them (revert or diagnose the real cause of the regression to do a proper fix).

The 5 changes in questions are listed in bug 1395525 patch.  

The idea is to go one by one or bisect) and force-enable them (s/nsContentUtils::IsLowerNetworkPriority()/true/) while tailing itself is preffed off ("network.http.tailing.enabled" at false) and run the same WPT/presto set of tests on each such setup.

Note that tailing on has a very positive effect on few sites like cnn and twitter (according the presto test, at least).  If we fix the regressions it might lead to further improvements of the tail feature effect.  But that is only to tempt you ;)


I'm tentatively giving this to Gary.
Depends on: tailing
Whiteboard: [necko-active]
After re-test with mitmproxy enabled, I can see the metric "Time to Start Render" has a significant improvement (50-time avg.).
Especially Twitter, the differences of tailing off/on to baseline are from -17%/-16% to 24%/20%, and Flickr is from -8%/-16% to 0%/7%

There are also some measurements show regressions, such as: CNN tailing on, NYTimes tailing off.

Direct Internet
           baseline   tailing-off       tailing-on
1 cnn      2709.80    2859.43  -6%      2656.30   2%
2 imgur    1382.53    1453.77  -5%      1523.73 -10%
3 ebay     2432.17    2509.23  -3%      2383.50   2%
4 500px    3298.40    3262.02   1%      3351.42  -2%
5 twitter  2248.44    2620.00 -17%      2602.92 -16%
6 flickr   1583.72    1705.34  -8%      1837.07 -16%
7 nytimes  1607.70    1623.56  -1%      1542.71   4%
8 wired     827.98     831.48   0%       814.62   2%

MITM proxy

           baseline   tailing-off       tailing-on
1 cnn      2192.54    2180.06   1%      2574.38 -17%
2 imgur    734.33     724.38    1%      743.22   -1%
3 ebay     2897.35    2546.78  12%      2495.38  14%
4 500px    2582.71    2589.22   0%      2564.04   1%
5 twitter  3523.86    2688.94  24%      2819.38  20%
6 flickr   811.86     811.20    0%      752.35    7%
7 nytimes  2021.48    2226.08 -10%      1991.84   1%
8 wired    1073.02    1106.62  -3%      1083.00  -1%
(In reply to Honza Bambas (:mayhemer) [off till 18.9. inc] from comment #0)
> The idea is to go one by one or bisect) and force-enable them
> (s/nsContentUtils::IsLowerNetworkPriority()/true/) while tailing itself is

I guess this is a typo of "nsContentUtils::IsTailingEnabled()"
because there's only occurance of nsContentUtils::IsLowerNetworkPriority().

> preffed off ("network.http.tailing.enabled" at false) and run the same
> WPT/presto set of tests on each such setup.

Pushed [1-5]

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cdd72befbeca0aa0a91b0889fed6a70d4b6a304
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfe2da58cb69ca95b972111f02ef417a4497047d
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6921770c93f0db08a86708416371695e3174adf1
[4] https://treeherder.mozilla.org/#/jobs?repo=try&revision=2a469db9f69b16021721e1c9c368bc3d7cfd753f
[5] https://treeherder.mozilla.org/#/jobs?repo=try&revision=bf05497718832fb8239483573bb1a70437992dda
Based on the experiment described in comment 2, it shows XMLHttpRequestMainThread has the most significant improvement among all 5 occurrences.

          FetchDriver  FaviconHelpers  ScriptLoader  XMLHttpRequestMainThread  nsChannelClassifier
1 cnn     2213.5       2135.0  4%      2213.0  0%    2134.0  4%                2213.0  0%
2 imgur   715.0         716.0  0%       715.0  0%     715.0  0%                 716.0  0%
3 ebay    4238.5       4200.5  1%      4315.5 -2%    4154.0  2%                4125.0  3%
4 500px   2614.0       2614.0  0%      2614.0  0%    2213.0 15%                2613.0  0%
5 twitter 2731.0       2814.0 -3%      2813.5 -3%    2616.0  4%                2813.0 -3%
6 flickr  515.0         515.0  0%       516.0  0%     515.0  0%                 515.0  0%
7 nytimes 2323.5       2173.5  6%      1964.5 15%    1664.5 28%                1865.5 20%
8 wired   1113.0       1015.0  9%      1016.0  9%    1016.0  9%                1021.0  8%
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P3 → P1
This is the on-going performance investigation work for 57, keeping as P1.
Whiteboard: [necko-active]
This is the 2nd try through MITM proxy, 50-runs average of "time to start rendering".

# URL        All-OFF   All-ON       FaviconHelpers FetchDriver   ScriptLoader   XMLHttpRequestMainThread   nsChannelClassifier
1 cnn        2217.50   2221.82 0%   2244.80  -1%   2229.64  1%   2179.64   2%   2263.29  -2%               2224.92 0%
2 imgur      1725.86   1681.50 3%   893.55   48%   1714.74  1%   1661.40   4%   1700.06   1%                954.20 45%
3 ebay       4927.98   4886.34 1%   4005.12  19%   4044.94 18%   4030.66  18%   3993.82  19%               5127.68 -4%
4 500px      2613.68   2606.68 0%   2666.82  -2%   2571.62  2%   2603.56   0%   2590.24   1%               2691.38 -3%
5 twitter    2767.56   2666.54 4%   2803.36  -1%   2827.82 -2%   2823.20  -2%   2813.54  -2%               2763.71  0%
6 flickr      662.50    624.80 6%    708.22  -7%    638.36  4%    764.24 -15%    744.84 -12%                639.46  3%
7 nytimes    2036.85   2007.59 1%   2106.56  -3%   2145.82 -5%   2107.28  -3%   2184.38  -7%               1898.24  7%
8 wired      1029.65   1020.81 1%   1050.60  -2%   1096.30 -6%   1062.92  -3%   1063.60  -3%               1067.84 -4%
3rd try, direct internet, 50-runs average of "time to start rendering".

# URL        All-OFF   All-ON        FaviconHelpers FetchDriver   ScriptLoader   XMLHttpRequestMainThread   nsChannelClassifier
1 cnn        2614.56   2562.57   2%  2560.16   2%   2572.24   2%  2674.98  -2%   2757.63 -5%                2601.98  0%
2 imgur      2553.69   2578.38  -1%  2515.74   1%   2557.12   0%  3000.39 -17%   2537.42  1%                2513.24  2%
3 ebay       3931.53   4115.12  -5%  3882.69   1%   4133.16  -5%  3881.44   1%   3838.90  2%                3790.44  4%
4 500px      3230.69   3493.40  -8%  3382.52  -5%   3683.43 -14%  3252.00  -1%   3269.55 -1%                3323.65 -3%
5 twitter    2326.10   2632.88 -13%  2573.79 -11%   2192.92   6%  2545.82  -9%   2539.84 -9%                2477.46 -7%
6 flickr     1563.06   1460.16   7%  1866.71 -19%   1764.65 -13%  1351.40  14%   1586.74 -2%                1556.00  0%
7 nytimes    1824.46   1997.86 -10%  1699.30   7%   1620.28  11%  1606.88  12%   1538.63 16%                1561.91 14%
8 wired       719.06    722.57   0%   707.82   2%    714.06   1%   702.70   2%    676.54  6%                 714.90. 1%
Hi Honza, the result is quite unstable, and the results show that it's more dependent on sites but not on patches.
Any thought?
Flags: needinfo?(honzab.moz)
The FaviconHelpers bit should not have any effect, since that code is usually triggered after the document has fully loaded.

The FetchDriver bit will likely not have any effect either, I don't expect any of the sites use the fetch api that much (if at all) to have a -14% regression effect (500px) or 11% improvement (nytimes) when throttled and tailed.

Since all other changes has similar noise, I don't think we can come to a conclusion.

Could we do two or three loops of exactly the same build (and prefs, let's say all-off) and see how the results differ?  Maybe there will be the same noise as well.

Ideas:
1. I'd try to reduce the number of the tests (All-ON, FaviconHelpers and FetchDriver can be removed for now)
2. Run the tests interleaved (All-OFF, ScriptLoader, XMLHttpRequestMainThread, nsChannelClassifier) 50 times so that all builds runs in more or less the same time
3. Probably forget about MITM since you loose h2 mainly on twitter, and probably more site and 3rd party content (that significantly changes the prioritization schemes)

This might somewhat reduce the noise.
Flags: needinfo?(honzab.moz) → needinfo?(xeonchen)
I'm now looking at http://presto.xeon.tw/ and I'm not sure what should I compare the 0907 tests to.  Where are the "no patch" runs (for the base changeset)?
I honestly don't think the test results from WPT at http://presto.xeon.tw/ can be taken seriously unless we somehow filter them.  There apparently is a lot of noise and as I like to call it "the weather effect".  When looking at the plot graphs of each of the test runs, clearly there are sometimes extremes (one super slow test) and also apparently spans where e.g. test runs #30 to #40 are consistently slower (someone using the same line in the office to download a large file during that period?)

I was looking mainly at the twitter.com/badger test results (comparing 0831-TAILING-NO-PATCH-TALOS, 0831-TAILING-OFF-TALOS, 0831-TAILING-ON-TALOS, 0907-TAILING-FORCE-ON FetchDriver - this is one is magically faster of them all according the calculated median).  When looking at the plots directly, clearly there is NO MAJOR DIFFERENCE at all between all of those runs.  It's both good and bad - we don't regress but we also can't say the tailing patch is a win.


I think the same likely applies to the "CDP Test" at http://10.247.28.241:8888, but hard to say.  Would be good to be able to switch the presto-like graphs to linear plotting as WPT does to see if there are external consistent influence effects or not.

Probably a classical histogram view with two results overlay would be good to have as well.

I think the interleaving idea might help to at least somewhat mitigate the external influence and the "weather" effect, for both WPT and CDP tests.


Gary, the ni? here still applies, I'd like to know the answers, thanks.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
(Assuming there is no regression, only test noise.)
Resolution: WONTFIX → INVALID
(In reply to Honza Bambas (:mayhemer) from comment #11)
> Gary, the ni? here still applies, I'd like to know the answers, thanks.

Sorry for late reply, but I still have some jobs in hand, so I'll check this later.
(In reply to Honza Bambas (:mayhemer) from comment #10)
> I'm now looking at http://presto.xeon.tw/ and I'm not sure what should I
> compare the 0907 tests to.  Where are the "no patch" runs (for the base
> changeset)?

You can compare to 0831-* without -mitm builds.
(In reply to Honza Bambas (:mayhemer) from comment #9)
> Ideas:
> 2. Run the tests interleaved (All-OFF, ScriptLoader,
> XMLHttpRequestMainThread, nsChannelClassifier) 50 times so that all builds
> runs in more or less the same time

The tests are already interleaved, i.e. it ran the first URL for all browsers, then the second URL for all browsers, ...etc.
Or do I misunderstand?
Flags: needinfo?(xeonchen) → needinfo?(honzab.moz)
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #15)
> (In reply to Honza Bambas (:mayhemer) from comment #9)
> > Ideas:
> > 2. Run the tests interleaved (All-OFF, ScriptLoader,
> > XMLHttpRequestMainThread, nsChannelClassifier) 50 times so that all builds
> > runs in more or less the same time
> 
> The tests are already interleaved, i.e. it ran the first URL for all
> browsers, then the second URL for all browsers, ...etc.
> Or do I misunderstand?

Sorry for late reply.  This is not what I mean.

Let's have 3 URLs (url1, 2, 3) and 2 different builds (build1, build2) to run.  The order should IMO be as follows:

first: build1 + url1, build2 + url1, build1 + url1, build2 + url1, and so on 50 times..., 
then: build1 + url2, build2 + url2, build1 + url2, build2 + url2 etc..

In other words, loop builds1 and 2 over a single URL (url1) 50 times.  Then loop both builds over url2, and then over url3.

Is it clear now?
Flags: needinfo?(honzab.moz) → needinfo?(xeonchen)
(In reply to Honza Bambas (:mayhemer) from comment #16)
> (In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #15)
> > (In reply to Honza Bambas (:mayhemer) from comment #9)
> > > Ideas:
> > > 2. Run the tests interleaved (All-OFF, ScriptLoader,
> > > XMLHttpRequestMainThread, nsChannelClassifier) 50 times so that all builds
> > > runs in more or less the same time
> > 
> > The tests are already interleaved, i.e. it ran the first URL for all
> > browsers, then the second URL for all browsers, ...etc.
> > Or do I misunderstand?
> 
> Sorry for late reply.  This is not what I mean.
> 
> Let's have 3 URLs (url1, 2, 3) and 2 different builds (build1, build2) to
> run.  The order should IMO be as follows:
> 
> first: build1 + url1, build2 + url1, build1 + url1, build2 + url1, and so on
> 50 times..., 
> then: build1 + url2, build2 + url2, build1 + url2, build2 + url2 etc..
> 
> In other words, loop builds1 and 2 over a single URL (url1) 50 times.  Then
> loop both builds over url2, and then over url3.
> 
> Is it clear now?

OK, but if we are going to use this sequence, the WPT results can't be grouped together.
I mean if I trigger a (build, url) pair and ask WPT to run 50 times, it generates a result.
But if I trigger (build, url) pair to run 1 time, and trigger 50 times, it generates 50 results.

It's doable, but makes the system more complicated.
My current design is every (build, url) pair will be stored only one record in database, it's easier and much faster to re-run tests or collect results.
Flags: needinfo?(xeonchen) → needinfo?(honzab.moz)
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #17)
> It's doable, but makes the system more complicated.

I don't expect otherwise ;)  

Before we were to start digging that way, could we somehow confirm that's worth the effort?  I know my suggestion is a mere guess from looking at the results.  Have we identified more scientifically causes of this kind of noise?  Are there other ways to mitigate/minimize it than overhaul of how the tests are split/grouped with an uncertain result?
Flags: needinfo?(honzab.moz)
You need to log in before you can comment on or make changes to this bug.