There was a Windows Txul and Tdhtml regression on the afternoon of September 29. The range is rather large; see bug 457885 for why. The last knomn good changset is http://hg.mozilla.org/mozilla-central/rev/61642beb4c16 The first known bad changeset is http://hg.mozilla.org/mozilla-central/rev/38a48d485876 That gives a range of: http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2008-09-29+13%3A00%3A00&enddate=2008-09-29+21%3A00%3A00 For the graphs showing the regression, see: http://graphs.mozilla.org/graph.html#show=395002,395014,395042,912144,1431854 http://graphs.mozilla.org/graph.html#show=787152,787165,787166,1431894 http://graphs.mozilla.org/graph.html#show=395032,395040,395060,912141,1431915 (The Txul regression shows up on Windows XP only; the Tdhtml regression shows on both Windows XP and Vista.) Note that bug 433616 part 2 was already backed out and was not the cause; see bug 457606 for the tracking of that regression (and relanding of the non-guilty parts).
OK. So looking through the list, the possible bugs that could be causing this are: Bug 455913 Bug 457050 Bug 373701 Bug 455311 Bug 457047 Bug 457728 Bug 455990 Bug 116649 Bug 455500 Bug 453723 Bug 329534 Bug 453723 Bug 457393 (unlikely; just ifdefs code off on non-Windows platforms) Bug 457313 landed but got backed out, so not a likely cause. ;)
http://graphs.mozilla.org/graph.html#show=912148 may show a Tp3 regression, too, although it's not noticeable on the other machines. (Note that Tp3 on *-fast is apparently a different pageset from Tp3 on non-*-fast.)
I meant bug 114649 above, not bug 116649. OK, current plan is to back out bug 114649 first. If that's not it, we try bug 373701.
Backing out bug 373701 seems to have no effect either. And it looks like giving tryserver a revision id doesn't make it actually build that revision... I guess more manual backouts tomorrow.
Bug 329534 was a test-only change. That means the remaining candidates are: Bug 455913 Bug 457050 Bug 455311 Bug 457047 Bug 457728 Bug 455990 Bug 455500 Going to back oug bug 455990 next.
We forced a new build of http://hg.mozilla.org/mozilla-central/rev/61642beb4c16 and it brought the number back down, so there does seem to be a real code regression. I'm thinking we should probably try http://hg.mozilla.org//mozilla-central/index.cgi/rev/07da123ca0b3 next (it's one of the builds that's known to compile, from bug 457885, and it's around the middle of the range).
I crashed (see bug 458092 comment 6) while typing this comment; it should probably get retyped but I'm too tired to do it tonight. (I crashed while switching tabs to verify if what I was typing on the last line actually made sense given the other changesets tested.)
The last sentence of the above comment (in the image) is potentially correct. I'm going to request a build of http://hg.mozilla.org/mozilla-central/rev/8858457b51ce
http://hg.mozilla.org/mozilla-central/rev/8858457b51ce (2008/10/05 02:30) gave numbers that I think are clearly within the post-regression band (although on the low side of them). And, oddly enough, it looks like relanding Boris's bug 433616 part 2 yesterday improved the numbers a little bit. A theory I've come up with that could explain this behavior is that some aspects of PGO are close to deterministic for a given set of code but quite sensitive to small variations in the code. (I wrote a script to count the "no profile data found" warnings in the log of a given build; they seem constant for multiple builds of the same changeset, but to vary widely between changesets.) I think it's pretty visible on qm-pxp-trunk07 (which is the "most stable" box that I alluded to in comment 11).
dbaron's comment: Here's a quick summary of the Tdhtml numbers from the fresh builds we did (based on going back and looking at things): http://hg.mozilla.org/mozilla-central/rev/07da123ca0b3 (2008/10/02 15:18) gave convincingly pre-regression numbers on one of the most stable machines but otherwise seemed more within the post-regression band (though that one machine convinced me that it was pre-regression). http://hg.mozilla.org/mozilla-central/rev/dd1c08d6d993 (2008/10/02 16:55) gave what look like pre-regression numbers. http://hg.mozilla.org/mozilla-central/rev/74aad43f37a5 (2008/10/03 07:50) gave numbers that were consistent with pre-regression (although at the high end of the range) but not consistent with post-regression. http://hg.mozilla.org/mozilla-central/rev/edc314aed893 (2008/10/03 13:21) gave much-improved but not clearly pre-regression numbers. http://hg.mozilla.org/mozilla-central/rev/c1f6a55626be (2008/10/04 14:00) gave clearly post-regression numbers. In hindsight, I'm thinking my conclusion that 07da123ca0b3 was pre-regression is somewhat suspect.
Yeah, shaver and I were worried that this might be PGO flutter... :(
http://hg.mozilla.org/mozilla-central/rev/afcc5aa0fb07 (2008/10/05 12:59) looks clearly pre-regression, even on Txul, which some of the others didn't make a blip on.
http://hg.mozilla.org/mozilla-central/rev/6357eb31cec6 (2008/10/05 15:38) looks like post-regression based on the data so far. I'm going to back it out shortly.
Well, it looks like the backout didn't help, although I'll give it a little more time.
Was https://bugzilla.mozilla.org/show_bug.cgi?id=455990 out long enough to get a test-run ? I'm probably blind, but I don't see a talos-test run.
Jim, it was out between "Thu Oct 02 07:03:28 2008 -0700" and "Sun Oct 05 20:45:11 2008 -0700". That would seem to be long enough, no?
So, the backout didn't help. Three theories: (1) There was a series of small regressions during the window, and backing out one or two changes isn't going to fix the regression (2) the regression is because our code size or complexity crossed some threshhold that makes certain things (perhaps PGO, perhaps something else) less efficient (3) there's a real code regression, but we're not able to find it because it's smaller than the variation caused by PGO's sensitive dependence on initial conditions (see comment 13) I don't think it's worth spending more time on this bug.
would it be useful to have a build system doing non-PGO builds and do perf testing on those as well? That might help in situtations like this as well as demonstrating what PGO is doing for us.
I agree with dbaron in comment 21, and think that we've blocked beta for as long as we can on this. I'll leave the bug open, though, in hopes that we can figure out what caused this. If you're on the cc list and an owner of one of the following bugs: > Bug 455913 > Bug 457050 > Bug 455311 > Bug 457047 > Bug 457728 > Bug 455990 > Bug 455500 Can you please take a good hard look through your patch and the affected areas to see if and how it might have affected our numbers here?
For what it's worth, I did that first thing when we had the pushlog... I didn't see anything in any of the patches involved. I'm also not sure that all the relevant people are cced, btw.
(In reply to comment #23) > If you're on the cc list and an owner of one of the following bugs: > > > Bug 455913 > > Bug 457050 > > Bug 455311 > > Bug 457047 > > Bug 457728 > > Bug 455990 > > Bug 455500 > > Can you please take a good hard look through your patch and the affected areas > to see if and how it might have affected our numbers here? Adding all remaining assignees except for bug 455311 where I don't have access.
I'm the assignee of bug 455311, so we're all good on that score.
I believe at this point we've relanded all the relevant changesets on m-c. Everyone, please double-check your patches are there!
Thoughts on continuing blocking on this? We need to get to the bottom of this some how.
(In reply to comment #28) > Thoughts on continuing blocking on this? We need to get to the bottom of this > some how. I think everything in the range was backed out, and the regression stayed, so a lot of small regressions added up, it's PGO being weird, or something else. At this point, I think this bug is likely INCOMPLETE.
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.