Closed Bug 982598 Opened 6 years ago Closed 5 years ago

HTTP cache v2: investigate talos regressions

Categories

(Core :: Networking: Cache, defect, major)

defect
Not set
major

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: mayhemer, Assigned: mayhemer)

References

()

Details

See URL for the current status spreadsheet.  +NN% means improovement, -NN% means regressions.  The worst is the SVG opacity.
Joel, I'd use some help from you with moving this forward.  I can investigate my self but I need some information on how to run talos tests, update the code, how to engage NSPR logging on the runs and how to interpret the results and also how to get a reference numbers to compare to.

Thanks!
Flags: needinfo?(jmaher)
my latest push with nspr logs:
https://tbpl.mozilla.org/?tree=Try&rev=0bfd96a29f51

unfortunately this is me replace tsvgr with a new latency test, so ignore the red 's' retriggers.  In fact, this doesn't help you as you want svg as that is the worse case with the v2 cache.

We could create another people based talos repo with the changes you want so my dev repo doesn't affect your investigations.

For comparing numbers, in the talos repo we have talos/compare.py, that script can compare results on try with a reference branch (i.e. inbound, firefox, fx-team).  it has come in handy for sanity checking patches.
Flags: needinfo?(jmaher)
Actually, first thing I'm interested in are two NSPR logs: one with cache2 on and with cache2 off.  I could compare the difference.  Modules I may be interested in now after I looked at some of the older logs from you are: nsHttp:4,cache2:5,cache:5 (for both cache2 on and off).  How complicated would be to have my own talos repo?
you can create a repo on a people account like this:
https://developer.mozilla.org/en-US/docs/Creating_Mercurial_User_Repositories

then you can edit talos to dump nspr (most of this patch):
http://hg.mozilla.org/users/jmaher_mozilla.com/tpain/rev/03896dc05f97

then push with talos.json from your m-c based tree:
     "global": {
-        "talos_repo": "https://hg.mozilla.org/build/talos",
-        "talos_revision": "7674ffd4b494"
+        "talos_repo": "http://hg.mozilla.org/users/jmaher_mozilla.com/tpain",
+        "talos_revision": "default"
     },


Of course change the talos_repo.  Now you are ready to test.
Joel, this doesn't produce logs for me: https://tbpl.mozilla.org/php/getParsedLog.php?id=36363502&full=1&branch=try and no nspr.log reference found, neither on tbpl.  My changes to the talos repo: https://hg.mozilla.org/users/honzab.moz_firemni.cz/cache2-talos/rev/a8009783ca63
Joel, can you please check I'm doing this correctly?  Please see https://tbpl.mozilla.org/?tree=Try&rev=3fd93c96ca3c.

I'm not getting NSPR logs and also the timing seems to be identical, so I probably do something wrong.

Thanks.
Flags: needinfo?(jmaher)
odd, I have the same code and I see your problem.  I see nspr logs for linux, but not for windows.  Is there something else we need to do on windows to get nspr logging?
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #8)
> odd, I have the same code and I see your problem.  I see nspr logs for
> linux, but not for windows.  Is there something else we need to do on
> windows to get nspr logging?

I think the upload dir may be missing.  I can test with linux too (will let you know).  It's just that the windows regression is so strong it could be easier to find the cause.

Thanks.
I see moz_upload_dir in the environment, I suspect it is working as you think, do we need any additional environment variables for firefox and windows?
(In reply to Joel Maher (:jmaher) from comment #10)
> I see moz_upload_dir in the environment, I suspect it is working as you
> think, do we need any additional environment variables for firefox and
> windows?

Definitely not, I do this every day... weird.  Also, the timing is identical for cache2 on and off so I don't think it's pulling from my repo somehow...
I looked at the talos.json modifications and they look right.  But that is a good point you make about the timings.
First conclusion:
regarding the svg_opacity test, it's clear from looking at timing between OnStartRequest and OnStopRequest - the new cache delivers data extremely slowly via 16kB chunks.  So, it's dupe of bug 913819 and bug 988011.
(In reply to Honza Bambas (:mayhemer) from comment #13)
> First conclusion:
> regarding the svg_opacity test, it's clear from looking at timing between
> OnStartRequest and OnStopRequest - the new cache delivers data extremely
> slowly via 16kB chunks.  So, it's dupe of bug 913819 and bug 988011.

No, the conclusion for svg (only) is to enlarge file chunk size: bug 988318.

I will rerun all talos tests and compare after all now pending changes get to the tree.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Latest run (current gum):

http://compare-talos.mattn.ca/?oldRevs=4d3b22782bf1&newRev=1f8386ff9889&server=graphs.mozilla.org&submit=true

tsvgx 
- average: tiny regressions/improvements, *gearflowers.svg: significant regression on some platforms* since the test file doesn't fit our chunk size (256kB)

tsvgr_opacity 
- matching, the win8 regression is there because one run took 462ms (it's a known test jittery)

tp4m 
- large improvements, but also some (less significant) regressions

tp5o_paint 
- *average: slight (not red) regressions*, *myspace.com: significant regression*, some small regressions on other sites too, but also improvements

tp5o_main_rss_paint
- matching (fixed from the last time)


Other tests are matching or improving.  tresize, ts_paint up to more then 8%!


To sum: chunk preloading may still be needed, I'll try to look how complicated would be to save some thread loops when loading metadata what is IMO (one of) cause(s) of the small regressions of tp5o_paint.
Update: http://compare-talos.mattn.ca/?oldRevs=4d3b22782bf1&newRev=65c3ed06cefb&server=graphs.mozilla.org&submit=true

very simple experimental patch [1] that saves some main-thread loops when opening existing cache entries (I'll file a bug and clean in up).

tp5o_paint 
In most cases - win!

Unfortunatelly doesn't help with the biggest tsvgx regression, the chunk preload is still needed.

[1] https://hg.mozilla.org/try/rev/1c8c7ba55f3e
Last regressive tests is tsvgx/*gearflowers.svg on And 4.0.  Bug 913819 helps here, but not enough.  Comparing log and timings cache1 vs cache2+preload:

https://tbpl.mozilla.org/?tree=Try&rev=59959b5563af (cache1)
https://tbpl.mozilla.org/?tree=Try&rev=0f8228930cfe (cache2+preload)
(In reply to Honza Bambas (:mayhemer) from comment #17)

It's cursed :)

https://tbpl.mozilla.org/?tree=Try&rev=0f93d6b96878 (cache1)
https://tbpl.mozilla.org/?tree=Try&rev=02a1b46e5dc0 (cache2+preload)
(In reply to Honza Bambas (:mayhemer) from comment #18)
> (In reply to Honza Bambas (:mayhemer) from comment #17)
> 
> It's cursed :)
> 
> https://tbpl.mozilla.org/?tree=Try&rev=0f93d6b96878 (cache1)
> https://tbpl.mozilla.org/?tree=Try&rev=02a1b46e5dc0 (cache2+preload)

Log is incomplete...  I'm missing the logcat file here.  Probably is not created for talos runs, but only for test runs like the jsreftests.... :(
I looked through the push with bug 1005475 and don't see any long standing regression- things seem to be looking much better.
Comment 23 shows that only gearflowers.svg regressed meaningfully - ~500ms or 26%.

It's probably safe to assume (do correct me if I'm wrong) that the extra 500ms at the result is only due to cache v2.

As much as we'd want it to, talos doesn't always show the whole picture, and sometimes also distorts the picture. In order to interpret the numbers, we should first understand what tsvgx does and what do the numbers represent:

- For the non-hixie tests: Takes a timestamp ("start"), then changes the URL to point to the file (an http://localhost/... url), then measures the time until the "mozAfterPaint" event ("end") which arrives after the "load" event. In essence it measures the load time of the page - aimed mostly at measuring parsing/rendering performance, but in practice also includes network overhead (to localhost), chrome rendering due to the location change, etc.

- The hixie.* tests: have code inside the pages themselves which takes a timestamp ("start") at the beginning of the file, then does some svg animation (fixed number of iterations per test which takes about 500-1000ms to complete), then takes another timestamp ("end") then calculates and reports the result (end - start). Depending when the first timestamp is executed/taken, it may or may not include the network overhead, and most probably doesn't include chrome rendering overhead.

So this means that even if the hixie pages load times are affected by network performance, the reported results won't necessarily be affected by them, because these pages measure their own "start" timestamp.

Of the non-hixie tests, gearflowers.svg is by far the largest file - 520k compared to 2-3k for the rest of the files (yes, ~200x bigger file). Of all the pages, the second largest is hixie007.svg which is 390k, and the rest are 100k or smaller.

Now that we know how the numbers are taken for each page, we should also take into account the whole tsvgx run and the numbers which compare-talos shows.

The full tsvgx run (row major) consists of loading and measuring all the svgx pages one after the other (as described above), and then repeating this sequence 25 times in a row without closing the browser.

All the numbers are then reported to both graphserver and datazilla. For now, Compare-talos takes its results from graphserver (this will change with bug 696196).

Graphserver (and consequently compare-talos for now) ignores the "worst" results from each such 25x run. I don't recall if worst per page's 25 results, or the entire run with the worst average of all the pages. Regardless, "worst" mostly turns out to be "first", and "worst" one of 25 is not represented at whatever compare-talos shows.

Datazilla doesn't ignore any of the numbers AFAIK.

So, guessing that "worst" is "first" here, it could mean that:

1. We don't know what's the performance difference on the first load of each page (so cache v2 could be either worse or better than the old cache).

2. That on the following 24 loads, this 520k file consistently takes extra 500ms on average more than it takes to load with the old cache.

So, with lots of assumptions about the cases where v2 performs worse than v1, the first thing you should ask yourself is if you can explain the difference which compare-talos shows (to yourself more than to us).

If you do, the next question is: do you find it acceptable? There could be many reasons for accepting such "penalty". E.g. it's rare, users are not likely to notice it, much more maintainable code, much faster in other places which aren't represented in talos results, etc. You can probably answer this question better than most others.

And from the answers to those, derive your next action.
Right now the way our system works we won't be seeing talos regression alerts for svgx on android.  Avi has outlined exactly what is happening here.  Sometimes on Android we see big differences which we don't see on desktop.  Also on android we load data over a real ip address and on desktop we do it via localhost (not sure if that matters for this code path).
Flags: needinfo?(jmaher)
(In reply to Avi Halachmi (:avih) from comment #24)
> Graphserver (and consequently compare-talos for now) ignores the "worst"
> results from each such 25x run. I don't recall if worst per page's 25
> results, or the entire run with the worst average of all the pages.
> Regardless, "worst" mostly turns out to be "first", and "worst" one of 25 is
> not represented at whatever compare-talos shows.
> 
> Datazilla doesn't ignore any of the numbers AFAIK.

I rechecked the facts, and here's a clearer picture of how we end up with the numbers:

datazilla - doesn't display or calculate overall average of all the sub-tests. It only displays the data per sub-test, where the top section is a graph where each data point is a median of 25 runs of the page, and if it's hovered, the bottom part shows all the individual 25 values for that point.

Graph server numbers go through different filters:
1. Talos first ignores the first 1-5 results per sub-test (of 25), then does median of the rest, then sends all these medians to graphserver (one median per sub test).
2. Graphserver takes all these medians, and ignores the highest of them (per data point, but usually it's the same sub-test which gets dropped).

Compare-talos (with graphserver) shows the overall change % on the main page which is the same value which graphserver shows (sub-tests average after dropping the highest median), but on the details page, compare talos shows all the sub-tests, including the one which got dropped for the overall % change.

The bottom line didn't change much - the first loading of a specific page in a talos run (single browser session) is ignored, either explicitly with graphserver or implicitly due to the median otherwise.

The assumptions and conclusions from comment 24 also remain valid.
(In reply to Avi Halachmi (:avih) from comment #24)
> So, with lots of assumptions about the cases where v2 performs worse than
> v1, the first thing you should ask yourself is if you can explain the
> difference which compare-talos shows (to yourself more than to us).

I was trying to find out by getting NSPR logs.  It failed for me even with a direct hack of prlog.c as once advised from you.  The NSPR logs were pretty incomplete in the resulting test log eventually.

Is there anything other I need to do to get logs from Panda?


Few cache2 impl details: we load and deliver data from disk to OnDataAvailable by 256kb chunks - the old cache delivers in larger chunks sometimes during just a single ODA call, tho.  In the latest m-c code we do a preload of 4 chunks (i.e. 1MB of data) ASAP we have opened (and roughly verified) the cache file.  This happens sooner then the actual read of the data is first time requested, so the 3 chunks needed here (for gearflower) to deliver the svg file should already be available in memory and delivered super-quickly to OnDataAvailable when requested.  Still, there are 3 calls to ODA, each with max of 256kB of data.

Here is a try push https://tbpl.mozilla.org/?tree=Try&rev=dd8ee9608c96 that changes the chunk size to 1MB to see if the granular delivery is the problem here.

I'm out of ideas for more short-term optimization in cache2 by now.
Another question is: would you be willing to accept this one regression for now with that we will fix it after the new cache is enabled for real?
as for accepting a regression now, it is a trade off if there is a bunch of goodness coming from the feature vs a single regression on a single platform.  If we track this and have a plan in place, I always advocate moving forward.  The danger becomes if the fix gets deprioritized and we start uplifting.

getting logs off of panda:
* do we know where they are going exactly?
** I suspect we are defining this via env vars and the reality is this is going to a directory which doesn't exist on panda boards.  I recommend /mnt/sdcard/tests/nspr.log
* we need to modify ffprocess_remote.py:
http://hg.mozilla.org/build/talos/file/b4907f0b27d3/talos/ffprocess_remote.py#l222 <- code which runs on android
* add in call to getfile:
http://hg.mozilla.org/build/talos/file/b4907f0b27d3/talos/ffprocess_remote.py#l117 <- at the end of runProgram, copy it to MUD like we do in ttest for other nspr.log files
(In reply to Honza Bambas (:mayhemer) from comment #27)
> I was trying to find out by getting NSPR logs.  It failed for me even with a
> direct hack of prlog.c as once advised from you.  The NSPR logs were pretty
> incomplete in the resulting test log eventually.

Sounds like we need to solve the logs issue before anything else. Otherwise, we're blind: can't reproduce locally, can't get logs, and can't explain the regression.

Come over to #perf on irc and let's fix this.

As for why it happens, based on a (pretty solid) assumption that the regression numbers are NOT on the first load of this page (because the first load numbers are ignored), these are the first things which come to my mind:

- v2 doesn't cache this 522k file, while v1 does cache it.
- Some/all of this file's data gets dropped from the cache on v2, but doesn't on v1.
- Swapping with v2, but not with v1.

Are the cache sizes the same between v1 and v2?
(In reply to Honza Bambas (:mayhemer) from comment #27)
> Here is a try push https://tbpl.mozilla.org/?tree=Try&rev=dd8ee9608c96 that
> changes the chunk size to 1MB to see if the granular delivery is the problem
> here.

OK, this is it, regression is gone with 1MB of chunk size: http://compare-talos.mattn.ca/?oldRevs=5c379a4d7577&newRev=dd8ee9608c96&server=graphs.mozilla.org&submit=true

Subsection      	Original 	New 	        % Change
gearflowers.svg 	1782.92 	1835.08 	2.93
Depends on: 1010221
No longer depends on: 1010221
Latest results

http://compare-talos.mattn.ca/?oldRevs=8ea0aaea5999&newRev=e162b19113a6&server=graphs.mozilla.org&submit=true

with bug 1010221 applied.

I'm closing this bug for now, since there are no major regressions any more.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.