Closed Bug 455508 Opened 16 years ago Closed 15 years ago

20% Tp regression on linux, September 5

Categories

(Core :: Graphics: ImageLib, defect, P5)

x86
Linux
defect

Tracking

()

VERIFIED WONTFIX
mozilla1.9.1

People

(Reporter: johnath, Assigned: joe)

References

Details

It seems to only show up on the fast linux box, but it's also persisted since.

http://graphs.mozilla.org/graph.html#show=911694&sel=1219792960,1221564879

Smaug's been looking into possible candidates, including possibly:

http://hg.mozilla.org/mozilla-central/rev/44853b2a5fc2 - joe's changes to imglib caching.
This graph link shows that it's not a problem on the "real" Tp boxes

http://graphs.mozilla.org/graph.html#show=911694,395125,395135,395166&sel=1219792960,1221564879

So if we're happy with that discordance, we could throw this bug away - but it feels like our fast box should track better than that.
There was also cairo update bug 453765.
The graph shows a short spike and then dropping back to the old level and again 
the regression. Maybe backing out bug 453765 caused that temporary drop?
Possible -- the first time the cairo update landed it was causing crashes on linux, though.  I don't know what would show up as a 20% hit here that doesn't show up at all on the full pageset, though.
Note that the new imagelib cache is a good bit smaller than the old one on machines with enough RAM.  That could impact the shorter test (which might have stayed entirely inside the bigger cache) more than the longer test (which thrashed even the bigger cache).
Hm, Joe's patch was checked in and backed out (due to memory leaks) in that timeframe (8am or so pacific).
I'm pretty sure that Boris' assumption is correct. I wasn't terribly concerned because the small Tp pageset is not really representative of real browsing.
(In reply to comment #6)
> I'm pretty sure that Boris' assumption is correct. I wasn't terribly concerned
> because the small Tp pageset is not really representative of real browsing.

Well.

I think there's not a lot of point to having fast boxes running Tp if we're comfortable ignoring their results.  So if we concede that the change is worth taking, then the question I have is how we make this test more useful to us in making decisions.  I don't think resolving this bug as "meh" is a happy situation.
I don't think we want to _ignore_ the results, just take it as a natural byproduct of saving ~8 MB of working set size.

http://graphs.mozilla.org/#show=912154,911659,911698

Naturally, just accepting is what I'm advocating. I'm willing to listen to others' arguments, though.
I think there's a difference between "meh" and "we understand exactly why this test shows a regression and why the other test doesn't and we feel that this regression is not very relevant to real usage and that the benefits from the change outweigh what drawbacks there are".

That said, I would be interested in bumping up the image cache size as a test just to make sure that really is the issue here.
To joe to perform said experiment once the tree reopens.
Assignee: nobody → joe
Flags: blocking-firefox3.1?
Flags: blocking-firefox3.1? → blocking-firefox3.1+
OS: Mac OS X → Linux
I'll do that experiment after the tree reopens after 1.9.1 beta 2.
Joe, you plan to do the experiment sometime next week (since we are reopneing then)?  I'm just trying to make sure this doesn't fall of the radar...
Yeah, it's in the back of my mind. Waiting, waiting, waiting!
Pushed a "double the image cache size" patch to a closed tree in http://hg.mozilla.org/mozilla-central/rev/17d1b1f439ea . We'll see what happens.
And backed out again in http://hg.mozilla.org/mozilla-central/rev/84bfba9454e2 - looking for data now.
With 2 of 3 fast talos boxes reporting, we've only got about a 1.5-2ms win on both Linux and Mac OS X. So clearly it's not being caused solely by memory, although that's a factor.

In an attempt to explain this, I discovered that bug 430061 (my imagelib cache patch) was backed out at 09:37 on that day, and not re-landed until 21:34. So, upon further reflection, I don't think that this Tp regression was actually caused by me. (Checkins on that day: http://tinyurl.com/77upcf)

I'm not sure where that leaves us...
Assignee: joe → nobody
Turns out I can't read a digital clock. You whippersnappers and your new inventions!

I still don't know where we're left. This regression is not caused solely by memory, and it doesn't show up on "regular" Tp boxes.
Assignee: nobody → joe
Joe, did doubling the cache size bring it back to the size it used to have?

Note that for this sort of test (unlike real-world browsing) all that matters is whether the cache is big enough to contain the whole test.  If it's smaller than that, then the first page is evicted by the time you get back to it.  If the test takes up 12 megs, then going from 4 to 11 will have no effect, while going from 11 to 13 will help a lot...
The size did seem to increase by roughly the delta of the cache size, so it's possible that it wasn't large enough to hold the whole working set.
Well, what happens if you set the new size to the memory cache size on those machines?  On the Linux boxes, that should be whatever size corresponds to 512MB RAM, as I recall.
Component: General → ImageLib
Flags: blocking-firefox3.1+
Product: Firefox → Core
QA Contact: general → imagelib
Target Milestone: --- → mozilla1.9.1
Flags: blocking1.9.1+
As far as I can tell, the memory cache size for 512 MB machines, as those machines apparently are, is 14 MB. http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/src/nsCacheService.cpp#547

One other thing to note is that the image cache forcibly expires elements every 10 seconds using nsExpirationTracker, and the image cache doesn't do that.

Current plan: Up it to 15 MB and give it a shot again, but with the significant caveat that we do that timed expiration.
You mean the memory cache doesn't expire things and the new image cache does?

The Tp3 test almost certainly takes more than 10 seconds to do a single cycle...

It's still worth trying with 15MB, of course.
Well, I checked in two changesets on 17 Jan: http://hg.mozilla.org/mozilla-central/rev/56f630c3bdf6, which upped the image cache to 15 MB, and http://hg.mozilla.org/mozilla-central/rev/bc77b82becfc, which removed the time-based expiry of image cache entries. This resulted in about 5 ms win on Tp, and about 10 MB increase in Tp_RSS. (Due to Talos screwedness, only the second patch got tested, but it didn't win us anything.)

These patches also made very little effect on the regular Talos machines - about 1 ms increase in Tp, and almost no difference in Tp_RSS.
That doesn't sound like a 20% improvement, right?
I really wish try server actually gave out sane performance numbers and that we could test particular changesets on it.  :(  As things stand, once we take too long to address a performance regression so much other stuff changes that it becomes hard to even tell whether we're testing backouts correctly.  :(
(In reply to comment #25)
> I really wish try server ... and that we
> could test particular changesets on it.

ssh://hg.mozilla.org/try/ provides testing for particular revisions, but the problem here is that the "fast" pageset is not tested by try server.

The qm-plinux-fast03 logs from around Sept 5 would be helpful, but the links from the tinderbox waterfall around that date link to mostly empty pages.  Is that because the logs have been deleted or is it possibly just the result of something being renamed?
Tinderbox log are deleted after sufficient time has passed.  If one is planning to not take action for a while on a bug which relies on such logs, they need to get attached to the bug.
One last try: http://hg.mozilla.org/mozilla-central/rev/84c6254be42a turned off time-based expiry and increased the cache size to 50 MB. 

This decreased fast-Tp from 250 ms to ~200 ms on the fast Talos box, and didn't decrease normal Tp by more than 1 or 2 ms. This tells me that it's simply a problem of cache size, and since we haven't affected normal Tp, which we consider more indicative of real browsing, I'm going to say this isn't really a problem.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
Verified.  We should get up a "real browsing" Tp test, though, where we load not just unrelated front pages but various pages off a single site in a row for a number of sites.  I think that would be a better reflection of real browsing as far as cache is concerned...

Want to file a bug on that?
Status: RESOLVED → VERIFIED
Bug 474338 filed.
Flags: blocking1.9.1+ → blocking1.9.1-
You need to log in before you can comment on or make changes to this bug.