455508 - 20% Tp regression on linux, September 5

Reporter

Description

•

17 years ago

It seems to only show up on the fast linux box, but it's also persisted since. http://graphs.mozilla.org/graph.html#show=911694&sel=1219792960,1221564879 Smaug's been looking into possible candidates, including possibly: http://hg.mozilla.org/mozilla-central/rev/44853b2a5fc2 - joe's changes to imglib caching.

Johnathan Nightingale [:johnath]

Reporter

Comment 1

•

17 years ago

This graph link shows that it's not a problem on the "real" Tp boxes http://graphs.mozilla.org/graph.html#show=911694,395125,395135,395166&sel=1219792960,1221564879 So if we're happy with that discordance, we could throw this bug away - but it feels like our fast box should track better than that.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 2

•

17 years ago

There was also cairo update bug 453765. The graph shows a short spike and then dropping back to the old level and again the regression. Maybe backing out bug 453765 caused that temporary drop?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 3

•

17 years ago

Possible -- the first time the cairo update landed it was causing crashes on linux, though. I don't know what would show up as a 20% hit here that doesn't show up at all on the full pageset, though.

Boris Zbarsky [:bzbarsky]

Comment 4

•

17 years ago

Note that the new imagelib cache is a good bit smaller than the old one on machines with enough RAM. That could impact the shorter test (which might have stayed entirely inside the bigger cache) more than the longer test (which thrashed even the bigger cache).

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 5

•

17 years ago

Hm, Joe's patch was checked in and backed out (due to memory leaks) in that timeframe (8am or so pacific).

Joe Drew (not getting mail)

Assignee

Comment 6

•

17 years ago

I'm pretty sure that Boris' assumption is correct. I wasn't terribly concerned because the small Tp pageset is not really representative of real browsing.

Johnathan Nightingale [:johnath]

Reporter

Comment 7

•

17 years ago

(In reply to comment #6) > I'm pretty sure that Boris' assumption is correct. I wasn't terribly concerned > because the small Tp pageset is not really representative of real browsing. Well. I think there's not a lot of point to having fast boxes running Tp if we're comfortable ignoring their results. So if we concede that the change is worth taking, then the question I have is how we make this test more useful to us in making decisions. I don't think resolving this bug as "meh" is a happy situation.

Joe Drew (not getting mail)

Assignee

Comment 8

•

17 years ago

I don't think we want to _ignore_ the results, just take it as a natural byproduct of saving ~8 MB of working set size. http://graphs.mozilla.org/#show=912154,911659,911698 Naturally, just accepting is what I'm advocating. I'm willing to listen to others' arguments, though.

Boris Zbarsky [:bzbarsky]

Comment 9

•

17 years ago

I think there's a difference between "meh" and "we understand exactly why this test shows a regression and why the other test doesn't and we feel that this regression is not very relevant to real usage and that the benefits from the change outweigh what drawbacks there are". That said, I would be interested in bumping up the image cache size as a test just to make sure that really is the issue here.

Boris Zbarsky [:bzbarsky]

Comment 10

•

17 years ago

To joe to perform said experiment once the tree reopens.

Assignee: nobody → joe

Flags: blocking-firefox3.1?

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

17 years ago

Flags: blocking-firefox3.1? → blocking-firefox3.1+

Stefan [:stefanh]

Updated

•

17 years ago

OS: Mac OS X → Linux

Joe Drew (not getting mail)

Assignee

Comment 11

•

17 years ago

I'll do that experiment after the tree reopens after 1.9.1 beta 2.

Shawn Wilsher :sdwilsh

Comment 12

•

17 years ago

Joe, you plan to do the experiment sometime next week (since we are reopneing then)? I'm just trying to make sure this doesn't fall of the radar...

Joe Drew (not getting mail)

Assignee

Comment 13

•

17 years ago

Yeah, it's in the back of my mind. Waiting, waiting, waiting!

Joe Drew (not getting mail)

Assignee

Comment 14

•

17 years ago

Pushed a "double the image cache size" patch to a closed tree in http://hg.mozilla.org/mozilla-central/rev/17d1b1f439ea . We'll see what happens.

Joe Drew (not getting mail)

Assignee

Comment 15

•

17 years ago

And backed out again in http://hg.mozilla.org/mozilla-central/rev/84bfba9454e2 - looking for data now.

Joe Drew (not getting mail)

Assignee

Comment 16

•

17 years ago

With 2 of 3 fast talos boxes reporting, we've only got about a 1.5-2ms win on both Linux and Mac OS X. So clearly it's not being caused solely by memory, although that's a factor. In an attempt to explain this, I discovered that bug 430061 (my imagelib cache patch) was backed out at 09:37 on that day, and not re-landed until 21:34. So, upon further reflection, I don't think that this Tp regression was actually caused by me. (Checkins on that day: http://tinyurl.com/77upcf) I'm not sure where that leaves us...

Assignee: joe → nobody

Joe Drew (not getting mail)

Assignee

Comment 17

•

17 years ago

Turns out I can't read a digital clock. You whippersnappers and your new inventions! I still don't know where we're left. This regression is not caused solely by memory, and it doesn't show up on "regular" Tp boxes.

Assignee: nobody → joe

Boris Zbarsky [:bzbarsky]

Comment 18

•

17 years ago

Joe, did doubling the cache size bring it back to the size it used to have? Note that for this sort of test (unlike real-world browsing) all that matters is whether the cache is big enough to contain the whole test. If it's smaller than that, then the first page is evicted by the time you get back to it. If the test takes up 12 megs, then going from 4 to 11 will have no effect, while going from 11 to 13 will help a lot...

Joe Drew (not getting mail)

Assignee

Comment 19

•

17 years ago

The size did seem to increase by roughly the delta of the cache size, so it's possible that it wasn't large enough to hold the whole working set.

Boris Zbarsky [:bzbarsky]

Comment 20

•

17 years ago

Well, what happens if you set the new size to the memory cache size on those machines? On the Linux boxes, that should be whatever size corresponds to 512MB RAM, as I recall.

Mike Connor [:mconnor]

Updated

•

17 years ago

Component: General → ImageLib

Flags: blocking-firefox3.1+

Product: Firefox → Core

QA Contact: general → imagelib

Target Milestone: --- → mozilla1.9.1

Mike Connor [:mconnor]

Updated

•

17 years ago

Flags: blocking1.9.1+

Joe Drew (not getting mail)

Assignee

Comment 21

•

17 years ago

As far as I can tell, the memory cache size for 512 MB machines, as those machines apparently are, is 14 MB. http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/src/nsCacheService.cpp#547 One other thing to note is that the image cache forcibly expires elements every 10 seconds using nsExpirationTracker, and the image cache doesn't do that. Current plan: Up it to 15 MB and give it a shot again, but with the significant caveat that we do that timed expiration.

Boris Zbarsky [:bzbarsky]

Comment 22

•

17 years ago

You mean the memory cache doesn't expire things and the new image cache does? The Tp3 test almost certainly takes more than 10 seconds to do a single cycle... It's still worth trying with 15MB, of course.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

17 years ago

Priority: -- → P5

Joe Drew (not getting mail)

Assignee

Comment 23

•

17 years ago

Well, I checked in two changesets on 17 Jan: http://hg.mozilla.org/mozilla-central/rev/56f630c3bdf6, which upped the image cache to 15 MB, and http://hg.mozilla.org/mozilla-central/rev/bc77b82becfc, which removed the time-based expiry of image cache entries. This resulted in about 5 ms win on Tp, and about 10 MB increase in Tp_RSS. (Due to Talos screwedness, only the second patch got tested, but it didn't win us anything.) These patches also made very little effect on the regular Talos machines - about 1 ms increase in Tp, and almost no difference in Tp_RSS.

Boris Zbarsky [:bzbarsky]

Comment 24

•

17 years ago

That doesn't sound like a 20% improvement, right?

Boris Zbarsky [:bzbarsky]

Comment 25

•

17 years ago

I really wish try server actually gave out sane performance numbers and that we could test particular changesets on it. :( As things stand, once we take too long to address a performance regression so much other stuff changes that it becomes hard to even tell whether we're testing backouts correctly. :(

Karl Tomlinson (:karlt)

Comment 26

•

17 years ago

(In reply to comment #25) > I really wish try server ... and that we > could test particular changesets on it. ssh://hg.mozilla.org/try/ provides testing for particular revisions, but the problem here is that the "fast" pageset is not tested by try server. The qm-plinux-fast03 logs from around Sept 5 would be helpful, but the links from the tinderbox waterfall around that date link to mostly empty pages. Is that because the logs have been deleted or is it possibly just the result of something being renamed?

Boris Zbarsky [:bzbarsky]

Comment 27

•

17 years ago

Tinderbox log are deleted after sufficient time has passed. If one is planning to not take action for a while on a bug which relies on such logs, they need to get attached to the bug.

Joe Drew (not getting mail)

Assignee

Comment 28

•

17 years ago

One last try: http://hg.mozilla.org/mozilla-central/rev/84c6254be42a turned off time-based expiry and increased the cache size to 50 MB. This decreased fast-Tp from 250 ms to ~200 ms on the fast Talos box, and didn't decrease normal Tp by more than 1 or 2 ms. This tells me that it's simply a problem of cache size, and since we haven't affected normal Tp, which we consider more indicative of real browsing, I'm going to say this isn't really a problem.

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → WONTFIX

Boris Zbarsky [:bzbarsky]

Comment 29

•

17 years ago

Verified. We should get up a "real browsing" Tp test, though, where we load not just unrelated front pages but various pages off a single site in a row for a number of sites. I think that would be a better reflection of real browsing as far as cache is concerned... Want to file a bug on that?

Boris Zbarsky [:bzbarsky]

Updated

•

17 years ago

Status: RESOLVED → VERIFIED

Joe Drew (not getting mail)

Assignee

Comment 30

•

17 years ago

Bug 474338 filed.

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

17 years ago

Flags: blocking1.9.1+ → blocking1.9.1-