Closed Bug 486236 Opened 11 years ago Closed 10 years ago

2009-03-07: 130ms Ts regression on OSX

Categories

(Firefox :: Session Restore, defect)

3.5 Branch
x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: beltzner, Assigned: zeniko)

References

()

Details

(Keywords: fixed1.9.1)

Attachments

(1 file)

Graph:

http://graphs-new.mozilla.org/#tests=[{%22test%22:%2216%22,%22branch%22:%223%22,%22machine%22:%2236%22}]&sel=1236381876,1236465509

Checkins:

http://hg.mozilla.org/releases/mozilla-1.9.1/pushloghtml?startdate=March+6+2009&enddate=March+7+2009

Suspect bugs:
bug 449198 - changes to HTTP observers on cache responses
bug 476928 - changes to nsISessionStore
bug 474917 - changes to browser.js

bug 462973 is also in that range, but Dietrich did his due diligence there.

(cc'ing bug committers, owners and reviewers)
Flags: blocking-firefox3.5+
Johnath, can I get you to try and work this one through, since Dietrich got the last one?
So a branch-only regression without a corresponding trunk regression?

The sessionstore change doesn't look like it would affect Ts much.  The browser.js change also doesn't look like it would.

The HTTP change might, if something is observing that notification and being slow on it, or if something else is severely broken (e.g. the observer service).  But doesn't seem likely.

I suppose we could back things out one by one and see what happens....
I would suspect the session store changes since it was changed on trunk (in a way that *might* affect Ts). The check alone may be faster (although I doubt that much) but if the Array.slice is called, maybe that's the cause?

The rest are either the same on trunk and branch or not changed in any way that really matters. I would suggest taking the second patch in bug 474917 as a test...

Just my .02 (R)
...oh and there is only one observer for the new notification in bug 449198, and it's a test. At least if my mxr search is right:

http://mxr.mozilla.org/mozilla-central/search?find=%2F&string=http-on-examine-cached-response
Sorry to spam so much, but if I read the graph correctly a more accurate pushlog would be:

http://hg.mozilla.org/releases/mozilla-1.9.1/pushloghtml?startdate=March+5+2009+00%3A00%3A00&enddate=March+6+2009+10%3A00%3A00

That's March 5 23:00 through March 6 09:00, the perf level went up between 06:00 and 09:00 on March 6, but the testing may have been behind (can't be ahead though :)

Which points to bug 480180 or bug 481528.
(In reply to comment #5)
Scratch that I'm wrong, it was on March 7th. Previous pushlog was correct.
(In reply to comment #2)
> So a branch-only regression without a corresponding trunk regression?

Not necessarily; the trunk numbers are more variant than 1.9.1, and the baselines seem to converge, which makes me think that the regression was already on trunk and just went unheeded?

Also, the Leopard regression shows up a day before the Tiger regression which makes me want to cry.

> I suppose we could back things out one by one and see what happens....

I think that's what we're going to have to do.
So first of all, pro-tip: pushloghtml's notion of "startdate" and "enddate" are based on the current time of day, so depending on time of day, the change history you see is excitingly divergent.

With that said, I went back to the tinderbox waterfall from that date, to see when the numbers spiked, and pulling out changeset IDs. By intersecting a couple of different boxes, the list is down to 3:

http://hg.mozilla.org/releases/mozilla-1.9.1/pushloghtml?fromchange=e40d9b4121f4&tochange=6f44d1dd4764

6f44d1dd4764	Simon Bünzli — Bug 476928 - Slow closing of tabs when entering Private Browsing mode. r=dietrich
0e90aaff293f	Simon Bünzli — Bug 462973 - 50ms increase in Ts 2008-11-02 am. r=dietrich
f162aa6d3ad9	Nochum Sossonko — Bug 474917 - Update status bar when dragging in/out of New Tab button. r=gavin

This, incidentally, matches exactly the range for the linux Ts improvement (minus NPOFB change e40d9b4121f4).  Windows is flat afaict and while there is a nearby Mac 9.2.2 spike a day earlier, I don't think we should consider that part of this bug.  If there isn't something already on file for the 9.2.2 bump, I'll file one.

Now, as Dao points out, in bug 462973 Dietrich tried a backout of two of the suspects (Bug 476928 and Bug 462973 - http://hg.mozilla.org/releases/mozilla-1.9.1/rev/dd6c7a41575d ) with no perceivable effect.

However, looking at the changeset and talking to Dietrich, I've come to learn that Bug 476928 was not actually part of the backout test. Given that, I think the next smartest thing to do in tracking this down is to back that changeset out and re-open that bug.
Does the restoreWindow method from bug 476928 even run for the Ts test? I would assume that it doesn't.
(In reply to comment #10)
> Does the restoreWindow method from bug 476928 even run for the Ts test? I would
> assume that it doesn't.

The changes in http://hg.mozilla.org/releases/mozilla-1.9.1/rev/f162aa6d3ad9 don't seem like they should be a 10% Ts hit either. I'm happy to have them both backed out.

I don't want to delay the fixes for these bugs without reason, but I also don't want to spend more time guessing at where the surprising regression is hiding. If we back out and see a fix, great.  If we backout and see no change, my only other plausible lead is bug 478603, but that was a day earlier (and more likely to have been the cause of the surprising 10.5 regression than the one we're discussing here.)
Well, there's still bug 462973, which clearly touches start-up code. Given its history, I wouldn't exclude it.
As discussed in IRC, going to back out each of these, sequentially. Bug 462973 is the last one I want to try (if at all) because while Dao's right that it has a bizarre history, it's also been backed out once before to no effect, and I have no reason to expect different results this time.
(In reply to comment #13)
> As discussed in IRC, going to back out each of these, sequentially. Bug 462973
> is the last one I want to try (if at all) because while Dao's right that it has
> a bizarre history, it's also been backed out once before to no effect, and I
> have no reason to expect different results this time.

First backout:

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/ce9c2499760d - Backout bug 476928 (changeset 6f44d1dd4764) as part of trying to track down Mac Ts regression.  Tracking in bug 486236
That backout didn't seem to change things on 8.8.2 fast talos after the first run, and those runs were pretty consistent at detecting the regression.

http://graphs-new.mozilla.org/graph.html#tests=[{%22test%22:16,%22branch%22:3,%22machine%22:49}]
Second backout:

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/98e52788513a - Partial backout of bug 474917 (strings stay put) as part of tracking down Ts regression on OSX - tracked by bug 486236
That backout didn't seem to change things either:

http://graphs-new.mozilla.org/graph.html#tests=[{%22test%22:%2216%22,%22branch%22:%223%22,%22machine%22:%2249%22}]

The next step is either to back out dietrich again while hoping for different results, or follow up on Alice's suggestion to reverse the Ts changes made a day earlier in bug 478603.  Neither of these options seems particularly plausible, but Alice's changes are a thing that should happen anyhow, to see if they remedy bug 478603's regression, so that feels like the obvious next step.

Alice, do you agree?
Backing out the changes in bug 478603 does risk increasing oranges on vista talos boxes - so we'd have to live with that.  I'd be willing to do a back out just as a double check that it really can't be the cause of this.
(In reply to comment #18)
> Backing out the changes in bug 478603 does risk increasing oranges on vista
> talos boxes - so we'd have to live with that.  I'd be willing to do a back out
> just as a double check that it really can't be the cause of this.

Yeah, I think long-term we shouldn't leave it backed out - we want non-orange talos boxes.  But short term, it would be good to know if it is a cause of this bug, or bug 478603.
Alice reversed this change last night at 5pm PDT.  Looking at some of the boxes, it seems to have partially reversed the regression, but there's still an unexplained chunk:

http://graphs-new.mozilla.org/graph.html#tests=[{%22test%22:%2216%22,%22branch%22:%223%22,%22machine%22:%2236%22},{%22test%22:%2216%22,%22branch%22:%223%22,%22machine%22:%2238%22}]

Before the regression we were around 1200ms on this test/platform/branch, after the regression we were about 1325, and after the talos backout we seem to be landing around 1275.

Alice - were there other changes pushed around the same time as the original Ts fix?  Talos numbers are pretty reliable, not trying to put this at your feet or anything, but all three of the patches in the window have been backed out at one point or another trying to track this, and so far the talos Ts backout is the only thing that's moved the needle.
That NPOTB label seems accurate.
(In reply to comment #21)
> I think http://hg.mozilla.org/releases/mozilla-1.9.1/rev/e40d9b4121f4 is also
> part of the regression range (although it's marked as NPOTB):
> 
> http://hg.mozilla.org/releases/mozilla-1.9.1/pushloghtml?fromchange=89df940af254&tochange=e212796b220f
> 
> from the source stamps of the relevant change on
> http://tinderbox.mozilla.org/showbuilds.cgi?tree=Firefox3.5&maxdate=1236445828&hours=24&legend=0&norules=1

I'm not sure it is.  If I use your tinderbox link and look at the fast-talos box (MacOSX Darwin 8.8.1 talos mozilla-1.9.1 fast qm-pmac-fast04) I see multiple low-Ts runs on e40d9b4121f4 before the regression. Since that was the last low-Ts score, you'll note that my window in comment 9 uses e40d9b4121f4 as the "fromchange"
I was looking at qm-pmac-talos01.
Looking over bonsai, the only Talos changes in the time range where the changes to the timeouts - which has already been backed out.  I don't see anything in the time range on the talos buildbot side of the system.
I notice in bug 478603 comment 18 Nick talks about bug 480577 landing around the same time - but I don't know how to check when it went live (or, indeed, why it should hurt Ts in any way.  Then again, I'm not clear on why changing the timeout should have...)

Is there any conceivable way it might be implicated?

Failing that, is it possible that the act of pushing updated talos code caused some kind of elevation on those boxes?  IIRC, we've had situations in the past where talos results were a bit off until the boxes were rebooted, but I also seem to remember them getting more regular reboots now?
Talos boxes are rebooting after every test cycle.  

The crash reporting work doesn't seem to line up with the regression range, so I don't quite see how it could affected Ts - especially if it is only a change observed on a single platform.
Okay - so I don't see it on other branches for this platform+test.  I don't see it for Txul on this platform+branch. And I mostly don't see it on other platforms for this branch+test (linux shows a mild improvement).

Each of the patches in the window has been backed out (2 by me here, one by Dietrich in bug 462973) without effect. Backing out the talos Ts fix seems to have brought the numbers down, but not all the way down, and alice has no other changes in the window there.

I'm out of guesses on the remainder of this regression.  Dietrich, I assume your backout was a complete one?

Who's got a hat with a rabbit in it?
Could subsequent talos changes have been masked by the initial change?
(In reply to comment #31)
> I just ran
> http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2009-03-05-02-mozilla-1.9.1/firefox-3.1b4pre.en-US.mac.dmg

Just to clarify - this is a build produced on the mozilla-1.9.1 branch prior to the Ts regression.
Johnathan reverse patched http://hg.mozilla.org/releases/mozilla-1.9.1/rev/0e90aaff293f against mozilla-1.9.1 on a try server build, and then I put it through a talos run.

http://graphs-stage2.mozilla.org/graph.html#tests=[{%22test%22:%2212%22,%22branch%22:%223%22,%22machine%22:%22113%22}]&sel=1239015358,1239231120

The most recent drop, with a value of 1167, is the result of that build.

(The other drops on the 6th and 7th are from me testing older builds to try and narrow the regression range)
Well that just puts us deep into crazy-town, doesn't it?

Summary (since Chris' last several comments might seem cryptic to someone who wasn't also present in our side conversations):

1) We backed out the Talos change and got back half of our loss.
2) We thought maybe there was "residual Talos bustage" of some kind accounting for the other half, so we took the current still-elevated talos and ran it against old builds from before the regression.
3) Those results came all the way back down.  Talos is working fine.

So if Talos is working fine, but we're still seeing the regression, that means it's a code regression, right? Except we backed out all three changesets in the window at one point or another (2 by me in this bug, Dietrich's before in bug 462973).

The only thing we could think of was that Dietrich's backout of 462973 happened while talos was testing high from its own changes, and so maybe that somehow masked things. When he backed it out, saw no effect, and re-landed it, maybe there was some mystical interaction with the Talos Ts changes that concealed the effect.

BUT we weren't going to make dietrich back out again, because it was tree churn on a seriously wishy washy hypothesis.

SO:

4) I ran a try server build backing out just the bug 462973 changeset ( http://hg.mozilla.org/releases/mozilla-1.9.1/rev/0e90aaff293f ).  Dietrich's backout included multiple code changes, but this one was just the little 10-liner. 
5) Catlee pointed one of his talos army (up to date Talos, without the Ts-regressing change) at that build.
6) See comment 33 - with that backout, the numbers returned to pre-regression values.

I think changeset http://hg.mozilla.org/releases/mozilla-1.9.1/rev/0e90aaff293f is the rest of the regression, and should probably be backed out (again).

Dietrich, Zeniko - what do you think?

(And yes, Dao gets an "I told you so" for comment 12.  :)
Chris: turns out that the patch Johnath reversed was only part of the session restore changes, and was the part that specifically fixed the Ts hit from the original change :)

I've submitted a full backout to the tryserver. When complete, can you run that through your Talos slave pool? I'll post a link here when the builds are available.
catlee is going to run these builds through the slave pool
Dietrich's build scores 1322.84 on Ts.
(In reply to comment #38)
> Dietrich's build scores 1322.84 on Ts.

sadly, that doesn't really tell us much. is this on the same box you used in the previous comments? eg:

http://graphs-stage2.mozilla.org/graph.html#tests=[{%22test%22:%2212%22,%22branch%22:%223%22,%22machine%22:%22113%22}]&sel=1239186618,1239743880

if so, these numbers are jumping around, with peaks much higher than the first run with my changes. maybe we should leave that build on there to see if a pattern emerges?
In what is now my patented tradition, a summary:

1) Backing out the talos change fixed half the regression
2) Backing out just the Ts-regressing part of dietrich's patch fixed the rest (but that is known, which is why it was buddied with a Ts-mitigation fix)
3) Backing out the Ts-regressing part of dietrich's patch AND the Ts-mitigation fix didn't move the needle.

If step 3 *had* moved the needle, we could say that dietrich's mitigation was insufficient, because backing out his change still made things better. We would back him out for real, and tell him to reland when he understood what was happening.  But that's not what happened.

Since backing out step 3 had no effect, we conclude that Dietrich's mitigation DOES work, that the code with both patches tests the same as the code with neither patch.

The win we thought we had in step 2 was pretend - sure we can back out half the patch to get our win back, but that's not really "fair" since that's only half a patch, and presumably we want the change it makes.

Are we back to wtf now?
For posterity.

<dietrich> in comment #41, you said: "Backing out just the Ts-regressing part of dietrich's patch fixed the rest..."
<dietrich> however
<dietrich> in comment #35, i pointed out that you didn't back out the Ts-regressing part of the ss patches
<dietrich> you backed out the Ts-fixing part
<dietrich> which makes the current state of wtf doubly so
<dietrich> resolved->diminishingreturns
<johnath> dietrich: wow, that's doubly... upsetting
<beltzner> dietrich: lol
<johnath> dietrich: is the Ts-fixing part otherwise, logically, something we want?  Because I'll be happy with a tree that has the SS fix, but not the Ts-regressing Ts-fix.  :)
<dietrich> johnath: iirc, the Ts mitigation part only regressed old-mac
<dietrich> and fixed it on all other platforms
<dietrich> but i'd need to consult the record to confirm that vague memory
<johnath> dietrich: lovely
So - catlee - you could run my experimental backout build against other platforms and see if, indeed, it hurts Ts on XP or 10.5.

If it does hurt other platforms, then we either need to understand why 10.4 is a unique snowflake, or we need to, as dietrich says, write it off as diminishing returns.

If it does not hurt other platforms, then I think we should make my experimental backout permanent, and remove the "fix" that hurts 10.4, and close this bug.

And have a beer.
Okay.

Backing out the Ts-fix makes Tiger about 10% better and Leopard... "some" amount better.  It makes XP a lot worse, linux a little worse, and the jury's out on Vista.

So. We know the patch at work, we know that it has platform specific effects.

IMO, this is now a standard understand-and-evaluate bug in places: what is it about Mac that is sucking with this otherwise-nice fix, can we fix the mac half of it, or are we just going to say "winning on windows+linux is worth losing on (mostly 10.4) mac"?  

Places folks - what's the word?
Component: General → Places
QA Contact: general → places
Wait - I thought this was session restore stuff?  What exactly is the code change that we are debating here?  It's been hard to follow along...
(In reply to comment #46)
> Wait - I thought this was session restore stuff?  What exactly is the code
> change that we are debating here?  It's been hard to follow along...

Yes, sorry, session restore.  This is the specific changeset in which we're all so keenly interested:

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/0e90aaff293f
Component: Places → Session Restore
QA Contact: places → session.restore
The root change is bug 395488. It is not a blocker. We do not tolerate performance regressions.

Bug 395488 should be backed out until it can be resolved without regressing startup on any platform.
(In reply to comment #48)
> The root change is bug 395488.

Note that bug 395488 is needed for the current crash behavior where we only show about:sessionrestore after a repeated crash. Without bug 395488, we might not be able to count crashes.

Wouldn't the more appropriate solution be to #ifdef the changes from bug 462973 to not apply for Mac?
(In reply to comment #49)
> (In reply to comment #48)
> > The root change is bug 395488.
> 
> Note that bug 395488 is needed for the current crash behavior where we only
> show about:sessionrestore after a repeated crash. Without bug 395488, we might
> not be able to count crashes.
> 
> Wouldn't the more appropriate solution be to #ifdef the changes from bug 462973
> to not apply for Mac?

That sounds fine, can you whip up a patch for that?
Attached patch fixSplinter Review
Assignee: nobody → zeniko
Attachment #373399 - Flags: review?(dietrich)
Attachment #373399 - Flags: review?(dietrich) → review+
Comment on attachment 373399 [details] [diff] [review]
fix

r=me, thanks. i'll push this to the try-server so we can get testable builds.
Whiteboard: [has patch, in testing]
catlee is going to drop the build back in and let it go for a couple of runs so we can get more context, since these windows numbers conflict with the earlier runs.
pushed to 1.9.1 for a few runs to see if we get clearer numbers

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/119013619753
Dietrich's landing has cooked overnight, and 10.4 numbers are back down a bunch, and 10.5 numbers down a bit, and no one else's numbers are up. I consider this bug resolved.

Thank you to everyone. Go buy yourselves a nice drink, happy Cinco de Mayo!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Keywords: fixed1.9.1
Whiteboard: [has patch, in testing]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is not on trunk yet. I just did the trial balloon landing on 1.9.1. Will check into trunk asap.
Sorry, it was a branch-only regression. Re-closing.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.