Closed Bug 1395652 Opened 7 years ago Closed 7 years ago

Find out which changes introduced in bug 1358060 cause regressions themselves

Categories

(Core :: Networking: HTTP, defect, P1)

defect

Tracking

()

RESOLVED INVALID
Tracking Status
firefox57 --- affected

People

(Reporter: mayhemer, Assigned: xeonchen)

References

Details

To explain: bug 1358060 introduces a pref'ed feature to tail tracker requests (put them to a special bucket with very low priority within the page load context). There were also few changes I found (via local testing) be important for the feature to work correctly and most efficiently. However, WPT/presto testing revealed that the changes themselves cause some load time regressions (when the tailing feature is otherwise pref'ed off). (To avoid these regressions in the wild, in case we would want to disable tailing with the pref, bug 1395525 has been filed and patched - it surrounds those changes with the tail-enable pref) We should invest some effort to finding out which of the changes are behind those WPT/presto regressions and update them (revert or diagnose the real cause of the regression to do a proper fix). The 5 changes in questions are listed in bug 1395525 patch. The idea is to go one by one or bisect) and force-enable them (s/nsContentUtils::IsLowerNetworkPriority()/true/) while tailing itself is preffed off ("network.http.tailing.enabled" at false) and run the same WPT/presto set of tests on each such setup. Note that tailing on has a very positive effect on few sites like cnn and twitter (according the presto test, at least). If we fix the regressions it might lead to further improvements of the tail feature effect. But that is only to tempt you ;) I'm tentatively giving this to Gary.
Depends on: tailing
Whiteboard: [necko-active]
After re-test with mitmproxy enabled, I can see the metric "Time to Start Render" has a significant improvement (50-time avg.). Especially Twitter, the differences of tailing off/on to baseline are from -17%/-16% to 24%/20%, and Flickr is from -8%/-16% to 0%/7% There are also some measurements show regressions, such as: CNN tailing on, NYTimes tailing off. Direct Internet baseline tailing-off tailing-on 1 cnn 2709.80 2859.43 -6% 2656.30 2% 2 imgur 1382.53 1453.77 -5% 1523.73 -10% 3 ebay 2432.17 2509.23 -3% 2383.50 2% 4 500px 3298.40 3262.02 1% 3351.42 -2% 5 twitter 2248.44 2620.00 -17% 2602.92 -16% 6 flickr 1583.72 1705.34 -8% 1837.07 -16% 7 nytimes 1607.70 1623.56 -1% 1542.71 4% 8 wired 827.98 831.48 0% 814.62 2% MITM proxy baseline tailing-off tailing-on 1 cnn 2192.54 2180.06 1% 2574.38 -17% 2 imgur 734.33 724.38 1% 743.22 -1% 3 ebay 2897.35 2546.78 12% 2495.38 14% 4 500px 2582.71 2589.22 0% 2564.04 1% 5 twitter 3523.86 2688.94 24% 2819.38 20% 6 flickr 811.86 811.20 0% 752.35 7% 7 nytimes 2021.48 2226.08 -10% 1991.84 1% 8 wired 1073.02 1106.62 -3% 1083.00 -1%
(In reply to Honza Bambas (:mayhemer) [off till 18.9. inc] from comment #0) > The idea is to go one by one or bisect) and force-enable them > (s/nsContentUtils::IsLowerNetworkPriority()/true/) while tailing itself is I guess this is a typo of "nsContentUtils::IsTailingEnabled()" because there's only occurance of nsContentUtils::IsLowerNetworkPriority(). > preffed off ("network.http.tailing.enabled" at false) and run the same > WPT/presto set of tests on each such setup. Pushed [1-5] [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cdd72befbeca0aa0a91b0889fed6a70d4b6a304 [2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=dfe2da58cb69ca95b972111f02ef417a4497047d [3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=6921770c93f0db08a86708416371695e3174adf1 [4] https://treeherder.mozilla.org/#/jobs?repo=try&revision=2a469db9f69b16021721e1c9c368bc3d7cfd753f [5] https://treeherder.mozilla.org/#/jobs?repo=try&revision=bf05497718832fb8239483573bb1a70437992dda
Based on the experiment described in comment 2, it shows XMLHttpRequestMainThread has the most significant improvement among all 5 occurrences. FetchDriver FaviconHelpers ScriptLoader XMLHttpRequestMainThread nsChannelClassifier 1 cnn 2213.5 2135.0 4% 2213.0 0% 2134.0 4% 2213.0 0% 2 imgur 715.0 716.0 0% 715.0 0% 715.0 0% 716.0 0% 3 ebay 4238.5 4200.5 1% 4315.5 -2% 4154.0 2% 4125.0 3% 4 500px 2614.0 2614.0 0% 2614.0 0% 2213.0 15% 2613.0 0% 5 twitter 2731.0 2814.0 -3% 2813.5 -3% 2616.0 4% 2813.0 -3% 6 flickr 515.0 515.0 0% 516.0 0% 515.0 0% 515.0 0% 7 nytimes 2323.5 2173.5 6% 1964.5 15% 1664.5 28% 1865.5 20% 8 wired 1113.0 1015.0 9% 1016.0 9% 1016.0 9% 1021.0 8%
Priority: P3 → P1
This is the on-going performance investigation work for 57, keeping as P1.
Whiteboard: [necko-active]
This is the 2nd try through MITM proxy, 50-runs average of "time to start rendering". # URL All-OFF All-ON FaviconHelpers FetchDriver ScriptLoader XMLHttpRequestMainThread nsChannelClassifier 1 cnn 2217.50 2221.82 0% 2244.80 -1% 2229.64 1% 2179.64 2% 2263.29 -2% 2224.92 0% 2 imgur 1725.86 1681.50 3% 893.55 48% 1714.74 1% 1661.40 4% 1700.06 1% 954.20 45% 3 ebay 4927.98 4886.34 1% 4005.12 19% 4044.94 18% 4030.66 18% 3993.82 19% 5127.68 -4% 4 500px 2613.68 2606.68 0% 2666.82 -2% 2571.62 2% 2603.56 0% 2590.24 1% 2691.38 -3% 5 twitter 2767.56 2666.54 4% 2803.36 -1% 2827.82 -2% 2823.20 -2% 2813.54 -2% 2763.71 0% 6 flickr 662.50 624.80 6% 708.22 -7% 638.36 4% 764.24 -15% 744.84 -12% 639.46 3% 7 nytimes 2036.85 2007.59 1% 2106.56 -3% 2145.82 -5% 2107.28 -3% 2184.38 -7% 1898.24 7% 8 wired 1029.65 1020.81 1% 1050.60 -2% 1096.30 -6% 1062.92 -3% 1063.60 -3% 1067.84 -4%
3rd try, direct internet, 50-runs average of "time to start rendering". # URL All-OFF All-ON FaviconHelpers FetchDriver ScriptLoader XMLHttpRequestMainThread nsChannelClassifier 1 cnn 2614.56 2562.57 2% 2560.16 2% 2572.24 2% 2674.98 -2% 2757.63 -5% 2601.98 0% 2 imgur 2553.69 2578.38 -1% 2515.74 1% 2557.12 0% 3000.39 -17% 2537.42 1% 2513.24 2% 3 ebay 3931.53 4115.12 -5% 3882.69 1% 4133.16 -5% 3881.44 1% 3838.90 2% 3790.44 4% 4 500px 3230.69 3493.40 -8% 3382.52 -5% 3683.43 -14% 3252.00 -1% 3269.55 -1% 3323.65 -3% 5 twitter 2326.10 2632.88 -13% 2573.79 -11% 2192.92 6% 2545.82 -9% 2539.84 -9% 2477.46 -7% 6 flickr 1563.06 1460.16 7% 1866.71 -19% 1764.65 -13% 1351.40 14% 1586.74 -2% 1556.00 0% 7 nytimes 1824.46 1997.86 -10% 1699.30 7% 1620.28 11% 1606.88 12% 1538.63 16% 1561.91 14% 8 wired 719.06 722.57 0% 707.82 2% 714.06 1% 702.70 2% 676.54 6% 714.90. 1%
Hi Honza, the result is quite unstable, and the results show that it's more dependent on sites but not on patches. Any thought?
Flags: needinfo?(honzab.moz)
The FaviconHelpers bit should not have any effect, since that code is usually triggered after the document has fully loaded. The FetchDriver bit will likely not have any effect either, I don't expect any of the sites use the fetch api that much (if at all) to have a -14% regression effect (500px) or 11% improvement (nytimes) when throttled and tailed. Since all other changes has similar noise, I don't think we can come to a conclusion. Could we do two or three loops of exactly the same build (and prefs, let's say all-off) and see how the results differ? Maybe there will be the same noise as well. Ideas: 1. I'd try to reduce the number of the tests (All-ON, FaviconHelpers and FetchDriver can be removed for now) 2. Run the tests interleaved (All-OFF, ScriptLoader, XMLHttpRequestMainThread, nsChannelClassifier) 50 times so that all builds runs in more or less the same time 3. Probably forget about MITM since you loose h2 mainly on twitter, and probably more site and 3rd party content (that significantly changes the prioritization schemes) This might somewhat reduce the noise.
Flags: needinfo?(honzab.moz) → needinfo?(xeonchen)
I'm now looking at http://presto.xeon.tw/ and I'm not sure what should I compare the 0907 tests to. Where are the "no patch" runs (for the base changeset)?
I honestly don't think the test results from WPT at http://presto.xeon.tw/ can be taken seriously unless we somehow filter them. There apparently is a lot of noise and as I like to call it "the weather effect". When looking at the plot graphs of each of the test runs, clearly there are sometimes extremes (one super slow test) and also apparently spans where e.g. test runs #30 to #40 are consistently slower (someone using the same line in the office to download a large file during that period?) I was looking mainly at the twitter.com/badger test results (comparing 0831-TAILING-NO-PATCH-TALOS, 0831-TAILING-OFF-TALOS, 0831-TAILING-ON-TALOS, 0907-TAILING-FORCE-ON FetchDriver - this is one is magically faster of them all according the calculated median). When looking at the plots directly, clearly there is NO MAJOR DIFFERENCE at all between all of those runs. It's both good and bad - we don't regress but we also can't say the tailing patch is a win. I think the same likely applies to the "CDP Test" at http://10.247.28.241:8888, but hard to say. Would be good to be able to switch the presto-like graphs to linear plotting as WPT does to see if there are external consistent influence effects or not. Probably a classical histogram view with two results overlay would be good to have as well. I think the interleaving idea might help to at least somewhat mitigate the external influence and the "weather" effect, for both WPT and CDP tests. Gary, the ni? here still applies, I'd like to know the answers, thanks.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
(Assuming there is no regression, only test noise.)
Resolution: WONTFIX → INVALID
(In reply to Honza Bambas (:mayhemer) from comment #11) > Gary, the ni? here still applies, I'd like to know the answers, thanks. Sorry for late reply, but I still have some jobs in hand, so I'll check this later.
(In reply to Honza Bambas (:mayhemer) from comment #10) > I'm now looking at http://presto.xeon.tw/ and I'm not sure what should I > compare the 0907 tests to. Where are the "no patch" runs (for the base > changeset)? You can compare to 0831-* without -mitm builds.
(In reply to Honza Bambas (:mayhemer) from comment #9) > Ideas: > 2. Run the tests interleaved (All-OFF, ScriptLoader, > XMLHttpRequestMainThread, nsChannelClassifier) 50 times so that all builds > runs in more or less the same time The tests are already interleaved, i.e. it ran the first URL for all browsers, then the second URL for all browsers, ...etc. Or do I misunderstand?
Flags: needinfo?(xeonchen) → needinfo?(honzab.moz)
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #15) > (In reply to Honza Bambas (:mayhemer) from comment #9) > > Ideas: > > 2. Run the tests interleaved (All-OFF, ScriptLoader, > > XMLHttpRequestMainThread, nsChannelClassifier) 50 times so that all builds > > runs in more or less the same time > > The tests are already interleaved, i.e. it ran the first URL for all > browsers, then the second URL for all browsers, ...etc. > Or do I misunderstand? Sorry for late reply. This is not what I mean. Let's have 3 URLs (url1, 2, 3) and 2 different builds (build1, build2) to run. The order should IMO be as follows: first: build1 + url1, build2 + url1, build1 + url1, build2 + url1, and so on 50 times..., then: build1 + url2, build2 + url2, build1 + url2, build2 + url2 etc.. In other words, loop builds1 and 2 over a single URL (url1) 50 times. Then loop both builds over url2, and then over url3. Is it clear now?
Flags: needinfo?(honzab.moz) → needinfo?(xeonchen)
(In reply to Honza Bambas (:mayhemer) from comment #16) > (In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #15) > > (In reply to Honza Bambas (:mayhemer) from comment #9) > > > Ideas: > > > 2. Run the tests interleaved (All-OFF, ScriptLoader, > > > XMLHttpRequestMainThread, nsChannelClassifier) 50 times so that all builds > > > runs in more or less the same time > > > > The tests are already interleaved, i.e. it ran the first URL for all > > browsers, then the second URL for all browsers, ...etc. > > Or do I misunderstand? > > Sorry for late reply. This is not what I mean. > > Let's have 3 URLs (url1, 2, 3) and 2 different builds (build1, build2) to > run. The order should IMO be as follows: > > first: build1 + url1, build2 + url1, build1 + url1, build2 + url1, and so on > 50 times..., > then: build1 + url2, build2 + url2, build1 + url2, build2 + url2 etc.. > > In other words, loop builds1 and 2 over a single URL (url1) 50 times. Then > loop both builds over url2, and then over url3. > > Is it clear now? OK, but if we are going to use this sequence, the WPT results can't be grouped together. I mean if I trigger a (build, url) pair and ask WPT to run 50 times, it generates a result. But if I trigger (build, url) pair to run 1 time, and trigger 50 times, it generates 50 results. It's doable, but makes the system more complicated. My current design is every (build, url) pair will be stored only one record in database, it's easier and much faster to re-run tests or collect results.
Flags: needinfo?(xeonchen) → needinfo?(honzab.moz)
(In reply to Gary Chen [:xeonchen] (needinfo plz) from comment #17) > It's doable, but makes the system more complicated. I don't expect otherwise ;) Before we were to start digging that way, could we somehow confirm that's worth the effort? I know my suggestion is a mere guess from looking at the results. Have we identified more scientifically causes of this kind of noise? Are there other ways to mitigate/minimize it than overhaul of how the tests are split/grouped with an uncertain result?
Flags: needinfo?(honzab.moz)
You need to log in before you can comment on or make changes to this bug.