Closed Bug 697092 Opened 13 years ago Closed 12 years ago

tp5 for Win7 became bi-modal after Oct. 3rd

Categories

(Testing :: Talos, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: armenzg, Unassigned)

Details

(Whiteboard: [SfN])

I posted this on dev.tree-management
#############
Hi,
Anyone has any theories on what landed on Oct. 3rd that made Tp5 so 
unstable on Win7?

There has been a lot of moving pieces in the last few weeks (switching 
to mozAfterPaint, loosing coverage on PGO/Non-PGO switchover) and that 
is why I am bringing this to you so you can help me figure this out.

tp5 used to very constant before but not anymore for win7. WinXP seems 
to still be quite constant.

http://graphs-new.mozilla.org/graph.html#tests=[[115,1,1],[115,1,12],[115,94,1],[115,94,12],[89,1,12],[89,1,1]]&sel=none&displayrange=90&datatype=running

Notice in the 90 days view that tp5 was quite constant (the old version) 
and that the MozAfterPaint was quite constant from the 21st to Oct. 3rd.

I looked at Releng's maintenance page and I didn't see anything that 
changed in the infra:
https://wiki.mozilla.org/ReleaseEngineering:Maintenance

NOTE: From Oct. 5th to Oct. 12th both PGO and Non-PGO builds were 
reporting to the same talos branch (We fixed it on the 12th) and we 
lacked posts on the PGO branches.

NOTE: We added on Sept. 21st MozAfterPaint which replaced the old tp5 run.

cheers,
Armen

DATES:
* mozAfterPaint is added on Sept. 21st to mozilla-central only: 
https://bug661918.bugzilla.mozilla.org/attachment.cgi?id=559510
* we enabled the new tp5 mozAfterPaint in all branches _and_ disabled it 
in all branches on Oct. 5th with:
http://hg.mozilla.org/build/buildbot-configs/rev/0b5f2fc8b6b0
I triggered a changeset from before Oct. 3rd:
tbpl.mozilla.org/?tree=Try&rev=e748ccc61c04

It should report in here for whoever ends up looking into this.

We should also look into triggering several win7 tp jobs for a given build and see if it varies.
Try run for e748ccc61c04 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=e748ccc61c04
Results (out of 3 total builds):
    success: 3
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/armenzg@mozilla.com-e748ccc61c04
Try run for 69f7bacbbd94 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=69f7bacbbd94
Results (out of 3 total builds):
    success: 3
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/armenzg@mozilla.com-69f7bacbbd94
changeset from Sep. 30th (f25928e4847d changeset): [1]
    tp5: 329.6 (details)
    tp5_shutdown: 1538.0 (details)

changeset from Sep. 30th (parent f25928e4847d - same as above):
    tp5_paint: 424.45 (details)
    tp5_shutdown_paint: 1294.0 (details)

changeset from today (parent 0b03882d8edf):
    tp5_paint: 535.96 (details)
    tp5_shutdown_paint: 1322.0 (details)

http://graphs-new.mozilla.org/graph.html#tests=[[115,1,12],[89,1,12]]&sel=1317022415157.2234,1319569434680&displayrange=90&datatype=running

I have re-triggered again each job.

f25928e4847d on graphs-new says that it scored 357 when it run on Sept. 30th compared to 424.45 of today (even though it is the same change).

compare-talos also yields that something is funny. Unfortunately, I am not sure if the tool is up-to-date wrt to which suites to compare:
http://perf.snarkfest.net/compare-talos/index.html?oldRevs=f25928e4847d&newRev=e748ccc61c04&tests=tp5_paint,tp5_memset_paint,tp5_pbytes_paint,tp5_shutdown_paint&submit=true

Perhaps we should find a nightly from before Oct. 3rd (since nightly builds don't get deleted) and trigger few win7 tp jobs to see if the numbers have changed. I am not sure triggering a new build of f25928e4847d through the try server could add unknown factors.

[1] https://tbpl.mozilla.org/?rev=f25928e4847d&jobname=tp
Assignee: nobody → armenzg
Priority: -- → P2
Changes that did not cause this:
################################
* reconfig on Oct. 5th: http://hg.mozilla.org/build/buildbot-configs/rev/403543dba072
* reconfig on Sept. 30th: http://hg.mozilla.org/build/buildbot-configs/rev/97f6d8506ccd
[armenzg@dm-wwwbuild01 zips]$ ls -lrt old/ | grep Oct
-rw-r--r-- 1 armenzg  build  6012439 Oct 14 09:11 talos.bug694579.zip
-rw-rw-r-- 1 jford    build  6011623 Oct 24 11:03 talos.bug696810.zip
* bug 688346 - only talos.zip deployed around that time
-rw-rw-r-- 1 asasaki  build  6002801 Sep 21 17:51 talos.zip

This *IS* a code issue because:
* Mozilla-Beta is unaffected (all other branches are)
* Mozilla-Aurora started getting affected after Nov. 8th (Firefox 8 merge day)

If I look only at "tp5" and load "Firefox" and "Mozilla-Inbound" [2] I can see that the problem was introduced from m-i into m-c.

The first hiccup was at 00:00 with a 14.8% increase:
http://hg.mozilla.org/integration/mozilla-inbound/rev/6127f8fddb96
This could just be interim.

The first *set* of several runs going up and down started at 17:00:
http://hg.mozilla.org/integration/mozilla-inbound/rev/265d39da5c3d

e) Mon Oct 03 07:26:49 2011 -0700 265d39da5c3d 447 Alexander Surkov — Bug 664142
d) Mon Oct 03 05:42:35 2011 -0700 704f37801611 --- Benoit Jacob — Bug 522193
c) Mon Oct 03 03:07:07 2011 -0700 be9874f75bae 332 Alexander Surkov — Bug 684818 
b) Mon Oct 03 00:56:44 2011 -0700 696394093f34 --- Masayuki Nakano — Bug 690700
a) Sun Oct 02 21:04:23 2011 -0700 f78254d32632 339 Matt Woodrow — Bug 691106

NOTE: After the changeset's column there is a column that indicates the tp5 scores.

I can see that before E (beginning on unstability) we don't have coverage for D and B. I will push these 5 changesets to try and run several tp5 runs for each one.
Unfortunately this is a unstability issue and not something that can easily be tracked.

[1]
http://graphs-new.mozilla.org/graph.html#tests=[[115,1,12],[89,1,12],[115,53,12],[115,52,12]]&sel=1314374991060,1322150991060&displayrange=90&datatype=running

[2]
http://graphs-new.mozilla.org/graph.html#tests=[[89,63,12],[89,1,12]]&sel=1317485391060,1317948642123&displayrange=90&datatype=running

[3]
http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2011-10-03&enddate=2011-10-04

[4]
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?startdate=2011-10-03&enddate=2011-10-04
Try run for b6db134c1ad0 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=b6db134c1ad0
Results (out of 3 total builds):
    success: 3
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/armenzg@mozilla.com-b6db134c1ad0
Try run for e8cc27139a98 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=e8cc27139a98
Results (out of 3 total builds):
    success: 3
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/armenzg@mozilla.com-e8cc27139a98
Try run for 9dcebd888842 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=9dcebd888842
Results (out of 3 total builds):
    success: 3
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/armenzg@mozilla.com-9dcebd888842
Try run for 7eadd1d998ff is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=7eadd1d998ff
Results (out of 5 total builds):
    success: 3
    failure: 2
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/armenzg@mozilla.com-7eadd1d998ff
Try run for 23ee7a546c10 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=23ee7a546c10
Results (out of 3 total builds):
    success: 3
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/armenzg@mozilla.com-23ee7a546c10
Try run for 3e5dfdc4a201 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=3e5dfdc4a201
Results (out of 3 total builds):
    success: 3
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/armenzg@mozilla.com-3e5dfdc4a201
We know that the problem was brought from m-i with at 06:36:47 -0700:
hg.mozilla.org/mozilla-central/rev/a896a9e237a0 (77992)
* Merge last good changeset from mozilla-inbound to mozilla-central 
which is 696394093f34 on m-i (Mon Oct 03 00:56:44 2011 -0700):
http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=25b8388347af

The score of a896a9e237a0 is 378 (+5.3%) and 402 (+6.4%).

ASSUMPTION WARNING: I assume that this two higher than the median runs are good enough to indicate that the problem appears on mozilla-central.

This discards C, D, and E. Why?
Because those changesets did not make it into mozilla-central until Mon Oct 03 16:22:50 2011 -0700:
http://hg.mozilla.org/mozilla-central/rev/25b8388347af 
which is 10 hours later.

Let's then ignore C, D and E and focus on A & B.

If I look at the runs of A I see that it has variancess. This means that it could be that the problem was introduced earlier than that but not seen in graphs-new or one of these extra variables:
* the tp5 run is with --mozAfterPaint (we did not have --mozAfterPaint for m-i at that time)
* the try build was without PGO which makes them to be compare difficult

I am going to do the following:
* setup a test master to run tp5 without --mozAfterPaint and attach a couple of production win7 testing machines
* run 696394093f34 (suspect) and f78254d32632 (hoped to be good known) with PGO enabled plus ceb9e5cad736 changeset which further in the past (Sun Oct 02 04:54:02 2011 -0700)

############################################

The results for the record:
e) 265d39da5c3d -> https://tbpl.mozilla.org/?tree=Try&rev=e8cc27139a98
d) 704f37801611 -> https://tbpl.mozilla.org/?tree=Try&rev=9dcebd888842
c) be9874f75bae -> https://tbpl.mozilla.org/?tree=Try&rev=23ee7a546c10
b') 696394093f34 -> https://tbpl.mozilla.org/?tree=Try&rev=7eadd1d998ff

b) 696394093f34 -> https://tbpl.mozilla.org/?tree=Try&rev=b6db134c1ad0
    tp5_paint: 425.66 (details)
    tp5_paint: 472.37 (details)
    tp5_paint: 471.4 (details)
    tp5_paint: 433.14 (details)
    tp5_paint: 428.85 (details)
    tp5_paint: 485.77 (details)

a) f78254d32632 -> https://tbpl.mozilla.org/?tree=Try&rev=3e5dfdc4a201
tp5_paint: 530.97 (details)
tp5_paint: 423.38 (details)
tp5_paint: 481.75 (details)
tp5_paint: 435.95 (details)
tp5_paint: 432.93 (details)
tp5_paint: 539.61 (details)
> * run 696394093f34 (suspect)

This changeset did not change any files used on Win 7.
OS: All → Windows 7
Priority: P2 → P3
This was determined not to be a releng change.
I doubt I will be able to investigate any further and I would hope someone else would like to pick this up.

Feel free to pass to another component or close the bug.

As far as I know we can detect regressions but looking at it in a bi-modal graph makes it visually hard.
Assignee: armenzg → nobody
Component: Release Engineering → Talos
Priority: P3 → --
Product: mozilla.org → Testing
QA Contact: release → talos
Summary: tp5 for Win7 became unstable after Oct. 3rd → tp5 for Win7 became bi-modal after Oct. 3rd
Version: other → Trunk
Whiteboard: [SfN]
That was WinXP but nevertheless I believe we're now good.

What do you think?
http://graphs.mozilla.org/graph.html#tests=[[206,131,1],[206,131,12]]&sel=none&displayrange=90&datatype=running
it looks good now,  we had a mess for a few weeks.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.