Closed Bug 400045 Opened 12 years ago Closed 12 years ago

Tp regression on bldlnx03 and talos qm-pxp01

Categories

(Core :: General, defect)

defect
Not set

Tracking

()

RESOLVED WONTFIX

People

(Reporter: johnath, Assigned: johnath)

References

()

Details

Alongside the Ts regression in bug 399955, there was a Tp regression reported on both the linux perf testing box and winxp talos.

bldlnx03: http://build-graphs.mozilla.org/graph/query.cgi?tbox=bl-bldlnx03_fx-linux-tbox-head&testname=pageload&autoscale=1&size=&units=ms&ltype=&points=&showpoint=2007%3A10%3A15%3A20%3A29%3A49%2C267&avg=1&days=3

Talos: http://graphs.mozilla.org/graph.html#spst=range&spstart=1192320000&spend=1192552685&bpst=cursor&bpsc=1192551525.0548356&bpstart=1192320000&bpend=1192552685&m1tid=17&m1bl=0&m1avg=1&m2tid=7&m2bl=0&m2avg=1

The tree was closed, and all the following patches were backed out (except timeless' js/ checkin, which is not included in the build):

http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=PhoenixTinderbox&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&cvsroot=%2Fcvsroot&mindate=1192464522&maxdate=1192479390

This, combined with backing out stuart's checkin for bug 296818, seems to have restored linux Tp, but talos is still reporting high numbers.  Since the window is empty, we're rebooting the box to see if that fixes things, since there was some earlier bustage that might have left traces of broken.  We recognize the inherent superstition in this.
OS: Mac OS X → All
Hardware: PC → All
The talos boxes seem a bit hosed.  I cannot get any graphs from them.  Every time i try to get a perf graph it hangs and I eventually get a script running too long pop-up.
(In reply to comment #1)
> The talos boxes seem a bit hosed.  I cannot get any graphs from them.  Every
> time i try to get a perf graph it hangs and I eventually get a script running
> too long pop-up.
> 

Try changing 'All data' to 'previous 5 days'.
(In reply to comment #2)
> (In reply to comment #1)
> > The talos boxes seem a bit hosed.  I cannot get any graphs from them.  Every
> > time i try to get a perf graph it hangs and I eventually get a script running
> > too long pop-up.
> > 
> 
> Try changing 'All data' to 'previous 5 days'.

Ah you are correct.  This probably needs to be fixed so that the links on the tinderbox pages provide proper values for these choices.  Like 5 days only, this machine only, etc.
The other links are valid. If you have a slow connection/slow computer/random other factor the full links will sometimes not work. I don't think there's a way to link to a "past 5 days graph" right now, unfortunately.
(In reply to comment #4)
> The other links are valid. If you have a slow connection/slow computer/random
> other factor the full links will sometimes not work. I don't think there's a
> way to link to a "past 5 days graph" right now, unfortunately.
> 

When I click on the ts_1 link on a talos build in the Tinderbox waterfall, I expect to get a graph of 5 days only for the trunk/branch of this build on this machine only, and only ts data.

What I get instead is a graph of all the data from all the talos hosts on 1.8 and 1.9.  It does get the part of only graphing "ts" correct, so I guess that is something.

I would file a bug on this if I had a clue out what component to file it on.
We've spent most of the day with the tree closed, trying to be intelligent about the backouts, and even rebooting talos (with the multi-hour data lag that implies) and the talos regression persists, so we're taking the "back everything out down to the regression point" approach defined in the Perf Regression policy doc ( http://www.mozilla.org/hacking/regression-policy.html#Implementation ).
Why are bits that aren't even built (or run) in Firefox backed out? That doesn't make any sense...
Gijs - the only reason to back everything out to the point of regression is to establish whether a regression is actually real.  As I mentioned in comment 6, we backed out anything we could intelligently back out first.  Then we backed out the rest of the window, per the perf regression policy.  I agree that it sucks, but we wanted a clean comparison.

Overnight, the talos regression still didn't go away, even though comparisons of the current codebase to the pre-regression codebase at 09:00AM PDT (Oct 15) showed no meaningful difference.  And yet the regression persists, across more than one talos box, across reboots.

This morning, Robcee, Mossop and I took a look at the per-site deltas, which skew heavily to RTL languages.  Matasuki checked in a small patch hours before the regression which touched RTL layout.  There were multiple non-abnormal talos runs after that check-in, so no one included it in their window when looking at the problem, but the theory we have is that those talos runs might have been deceptively low - inconveniently timed noise-based lows, and that it was actually the source of the Talos regression.  This was covered up by the fact that later checkins on the 15th introduced regressions to linux-Tp and box weirdness broke Ts (bug 399955).

The best plan I can see now is to back matasuki out and see if talos recovers.  This is complicated by a scheduled maintenance window on testing boxes from 6AM-9AM PDT today, because we don't want to be backing out without unit test boxes.
From offline discussion in today's perf meeting, reassigning to Core::General, johnath as he's been chasing down this issue...
Assignee: build → nobody
Component: Build & Release → General
Product: mozilla.org → Core
QA Contact: mozpreed → general
Version: other → unspecified
Assignee: nobody → johnath
Bug 399955 could only affect pages that use text-align:justify a lot, and those not much. It's not actually RTL related.
(In reply to comment #9)
[...]
> This is complicated by a scheduled maintenance window on testing boxes from
> 6AM-9AM PDT today, because we don't want to be backing out without unit test
> boxes.

If this is the case then maybe we can talk to IT about rescheduling the maintenance window to avoid further tree closure due to this odd regression.
Though I guess it's a bit late for that now, duh.
Can we run a build well before the regression interval through the testing boxes to verify the numbers?
As an update for those eagerly awaiting resolution here:

 - backing masayuki out produced some initially low numbers, but most talos runs on both pxp01 and pxp03 show times in the regression band.
 - After this failed, we tried schrep's suggestion, and ran two baseline runs on pxp03 using drivers from saturday (13th) and sunday (14th) to see what the machine would say.  Results here show both times in the low 900s, consistent with pre-regression Tp: http://graphs.mozilla.org/#spst=range&spstart=1192273200&spend=1192363200&bpst=cursor&bpstart=1192273200&bpend=1192363200&m1tid=33984&m1bl=0&m1avg=0
 - pxp03 was then moved back to trunk, where it has moved back to ~920 (post-regression times).

Stuart did another code-compare of trunk against a pre-regression version, diff found here: http://people.mozilla.com/~pavlov/changes.diff

He followed up with the following analysis, summarizing a lot our thoughts, I think:

>"The textframe changes were apparently backed out but were earlier than 9am which I've verified by diffing against an earlier date.
> Assuming that the NSS changes in my diff are due to cvs being stupid and diffing the trunk rather than the tag/branch, there are no changes left.
>I've got nothing." 

David Baron did a binary diff to try to spot changes that might be missed in source code.  The results are here: http://pastebin.mozilla.org/221119.  He concludes with:

> Relatively few differences in components/, but the NSS and NSPR library diffs
> scare me a little.  But there are also some known source diffs between these builds.

Dave Townsend analyzed 10 per-page data sets from before and after, and found post-regression pages that were more than 2SD elevated from their prior runs.  That list is here: http://pastebin.mozilla.org/220887

Rob Sayre has pointed out that many of the top-regressing sites showed up in bug 396064 as well, (see attachment 280855 [details]).  In that case, a reboot fixed the results.  Sayre also speculated that, since these tend to be long pages, it might be a window size or scrollbar issue on the test box, rather than a code change.  I got Ben Hearsum to check this, earlier today:

09:10 < johnath> did you see sayre's speculation in here yesterday?  That there's a window size/scrollbar thing 
                 happening, because  those pages tend to be long ones?
09:10 < johnath> unlike the various regional googles, for instance, none of which show up in the list
09:10 < johnath> is that something you can weigh in on?  That maybe talos is using weird window size or resolution or is otherwise 
                 not-usual in a way that would impact how much of a page is visible?
09:11 < bhearsum> the window size always stays the same, actually
09:12 < bhearsum> so, i would say that's a non-issue
09:13 < bhearsum> pxp01 and pxp03 both have a resolution of 1280x1024, the firefox window is ~1024x768 (whatever it is, it's the 
                  same on every run, for the whole run)
To note, the two baseline results on pxp03 are from the nightlies 2007101304 and 2007101405.

The checkins since the nightly of the 14th are:
http://bonsai.mozilla.org/cvsquery.cgi?module=PhoenixTinderbox&date=explicit&mindate=1192359540&maxdate=1192464522
The build starting 2007-10-18 09:19 PDT is a clobber, after the VM was rebooted.
note, dbaron's binary diff reveals that sqlite.dll changed

from http://pastebin.mozilla.org/221119

Binary files 2007-10-15-04-trunk/firefox/sqlite3.dll and
2007-10-17-23-trunk-tbox/firefox/sqlite3.dll differ

note, there were no changes to mozilla/db/sqlite during this window.

not sure why that .dll changes each time I build sqlite, but it does.  

(I'm going to log a spin off bug on that, as this will be increasing the size
of our partial updates.)
There was some discussion in #developers suggesting that binary files contain some unique number so always differ from build to build:

<luser> oh the binaries always have a different pdb signature
<luser> on windows
<luser> anytime you do a clobber the compiler generates a new UUID for each PDB file
<luser> then it increments an age field in that signature every time you rebuild
Adding a link here because it isn't here already - on Tuesday night several of us (notably KaiE and Wolf) discused putting together a list of patches backed out, organized by blocker/approval status, for eventual re-landing.

http://wiki.mozilla.org/Tp_regression_relanding_20071016
Update: The nightly tinderbox was rebooted. Both Talos Tp numbers stayed within the regression band.
One possibility is that a series of dep builds leading up to the regression window left the build in an anomalous state where it was actually faster than a clobber build. Then during the regression window we lost that anomaly.

Okay, it's an unlikely scenario, but at this point all scenarios are unlikely.

Personally I think we should reopen the tree and carry on, while saving away the two builds and have someone carry on doing forensic analysis in the hope of figuring out what happened. Because the scary thing is, as far as we know it could happen again.
(In reply to comment #15)
>  - After this failed, we tried schrep's suggestion, and ran two baseline runs
> on pxp03 using drivers from saturday (13th) and sunday (14th) to see what the
> machine would say.  Results here show both times in the low 900s, consistent
> with pre-regression Tp:

Also, however, consistent with the anomalous runs right before qm-pxp03 was taken off line to do the baseline runs:

http://graphs.mozilla.org/#spst=range&spstart=1192147200&spend=1192739095&bpst=c
ursor&bpstart=1192147200&bpend=1192739095&m1tid=33984&m1bl=0&m1avg=0&m2tid=17&m2
bl=0&m2avg=0&m3tid=7&m3bl=0&m3avg=0
if you look at the dip on the afternoon of the 17th -- it's pretty much even with the 2 baseline runs.

> Stuart did another code-compare of trunk against a pre-regression version, diff
> found here: http://people.mozilla.com/~pavlov/changes.diff

I did a binary diff of the hourlies from http://hourly-archive.localgho.st/win32.html -- in particular,
20071015_0514_firefox-3.0a9pre.en-US.win32.zip and
20071018_0001_firefox-3.0a9pre.en-US.win32.zip, and I haven't found
any differences of interest.  diff says that 4 jar files differ, two
text files with the build dates in them differ, and a bunch of
dlls/exes differ (plus the two NSS .chk files).

I unzipped the 4 jar files and there are no differences between
their unzipped contents.

I did an objdump -sCd (on Linux) of all the exes and dlls that
differ, and all the differences are in the .rdata section, and seem
pretty minor -- consistent with the linker producing different
binaries, although I should perhaps look a little more closely.


I also did a pull recently on a tree that hadn't pulled since saturday, and there were no updates in NSPR or NSS, so I'm pretty confident there wasn't any tag pushing (which I think is now against policy, but they used to do it a lot).
(In reply to comment #23)
> I also did a pull recently on a tree that hadn't pulled since saturday, and
> there were no updates in NSPR or NSS, so I'm pretty confident there wasn't any
> tag pushing (which I think is now against policy, but they used to do it a
> lot).

Yes, the contents of tags are no longer moving.
For updating NSPR / NSS we produce new tags and change mozilla/client.mk 
From Stuart:

OK, I talked to schrep and decided that we should reopen the tree for
metered checkins of beta blockers and then continue to meter checkins
until either a) we're happy with the perf numbers or b) we don't
think there will be a thousand checkins at once.

We'll try to get new minis up running at half speed that we can use
to replace the existing talos blade boxes with asap.
(In reply to comment #25)
> From Stuart:
> 
> OK, I talked to schrep and decided that we should reopen the tree for
> metered checkins of beta blockers and then continue to meter checkins
> until either a) we're happy with the perf numbers or b) we don't
> think there will be a thousand checkins at once.
> 
> We'll try to get new minis up running at half speed that we can use
> to replace the existing talos blade boxes with asap.

Given that the only mini that I can see that is giving even half-way usable results also seems to show a Tp regression, how does replacing the two talos blade boxes we have help us?
This bug has been sitting idle for a while and in last week's perf meeting, the story was that someone (stuart?) thought maybe the boxes had fixed themselves.

Is this still under investigation, and/or does anyone have reason to expect that an answer will be forthcoming?  Roc had mentioned taking some before and after builds offline to continue cross-testing but I don't know the state of that work, if any.
(In reply to comment #27)
> This bug has been sitting idle for a while and in last week's perf meeting, the
> story was that someone (stuart?) thought maybe the boxes had fixed themselves.
> 
> Is this still under investigation, and/or does anyone have reason to expect
> that an answer will be forthcoming?  Roc had mentioned taking some before and
> after builds offline to continue cross-testing but I don't know the state of
> that work, if any.

Having given my last question a month, I'm resolving this bug WONTFIX since we have no definitive action to take here.

Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.