Closed Bug 1159384 Opened 7 years ago Closed 7 years ago

Capture available build time comparisons for build slaves in AWS versus Colo hardware

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows Server 2008
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: q, Assigned: q)

References

()

Details

(Whiteboard: [windows])

Attachments

(1 file)

No description provided.
Depends on: 1124303
Numbers cor total build times worked out during the recent Portland Work week:

Ninty five percentile calculated try end to end build times on windows 2008 

AWS region USE1 in play ( slow network issues)  - 7000 seconds 
AWS region USW2 only -  5403 seconds
AWS region USW2 r3.2xlarge instance types with SSDs - 5100

IX hardware in slc3 - 5611
Assignee: relops → q
So based on the first pass it appears that instance types in AWS with no known network issues are faster than our hardware in colo numbers. However, these jobs need a deeper breakdown of type etc.. As a proof of concept I believe AWS 2008 builders are a success. However, in troubleshooting the AWS regional network issues we lost our sterile environment. I believe the numbers are encouraging and that a controlled staging environment should be built in both the colo and AWS USW2 focusing on instance types r3.xlarge and higher. Based on conversations with Releng management there is no set standard for these metrics or prescribed method for retrieval.  
I have been referred  to Nick for some analysis training and  to https://secure.pub.build.mozilla.org/builddata/buildjson/ to start pulling data after the tests environments are settled.
Duplicate of this bug: 1149541
Let's re-instantiate the USW2 instances, put them in staging, and run direct comparison jobs there and in SCL3.

Can we please also test out using local storage (I'm not sure if we'll have enough) for builds and get numbers on that?
Flags: needinfo?(q)
Attached file trace1.txt
Q -> AWS
--------
4/16/2015

Traces captured during a slow uplaod
Q: with the stack tuning that we've now done, what do our numbers look like?
Whiteboard: [windows]
Here is the current successful try  build times in seconds with a 95 percentile measurement:

IX machines: 11648
EC2 r3.Xlarge: 6594
EC2 r3.2Xlarge: Need more data ( being collected now )

These numbers are based on a smaller sample size than is ideal but show a promising trend. The median shows a less dramatic difference. I have created a parse and calc python script so we can keep tabs on these measurements as our sample size grows.

Data gathered from:
https://secure.pub.build.mozilla.org/buildapi/recent/
Flags: needinfo?(q)
What specific build type is this for?
All successful builds. I can narrow the scope during parse. Do you have any suggestions for which type I should focus on ?
By build type IX:


b2g-inbound-win32 
4367

b2g-inbound-win32-debug 
5669

b2g-inbound-win32-mulet 
4934

b2g-inbound-win32-pgo 
15769

b2g-inbound-win32_gecko 
6564

b2g-inbound-win32_gecko-debug 
6564

b2g-inbound-win64 
6813

b2g-inbound-win64-debug 
6695

b2g-inbound-win64-pgo 
15932

comm-aurora-win32-l10n-dep 
586

comm-beta-win32 
5347

comm-beta-win32-debug 
6514

comm-central-win32 
6251

comm-central-win32-debug 
6327

comm-esr38-win32 
10066

comm-esr38-win32-debug 
9301

cypress-win64-debug 
6922

fuzzer-win64-rev2 
2042

fx-team-win32 
6360

fx-team-win32-debug 
5693

fx-team-win32-mulet 
3144

fx-team-win32-pgo 
15745

fx-team-win32_gecko-debug 
6799

fx-team-win64 
5396

fx-team-win64-debug 
6860

fx-team-win64-pgo 
15750

fx-team_win32-debug_spidermonkey-compacting 
8696

jamun_win32-debug_spidermonkey-compacting 
8940

mozilla-aurora-win32 
11861

mozilla-aurora-win32-l10n-dep 
1299

mozilla-aurora-win64 
16934

mozilla-aurora-win64-l10n-dep 
485

mozilla-b2g32_v2_0-win32_gecko-nightly 
11877

mozilla-b2g34_v2_1-win32_gecko-nightly 
14482

mozilla-b2g37_v2_2-win32-mulet-nightly 
5868

mozilla-beta-win32 
13712

mozilla-beta-win64 
19285

mozilla-central-win32-mulet-nightly 
4954

mozilla-central-win32-pgo 
19254

mozilla-central-win32_gecko-nightly 
7276

mozilla-central-win64-pgo 
19260

mozilla-inbound-win32 
6170

mozilla-inbound-win32-debug 
7552

mozilla-inbound-win32-mulet 
5265

mozilla-inbound-win32-pgo 
13499

mozilla-inbound-win32_gecko 
7127

mozilla-inbound-win32_gecko-debug 
8401

mozilla-inbound-win64 
7302

mozilla-inbound-win64-debug 
7076

mozilla-inbound-win64-pgo 
15608

mozilla-inbound_win32-debug_spidermonkey-compacting 
7718

mozilla-inbound_win32-debug_spidermonkey-plaindebug 
4954

mozilla-inbound_win32_spidermonkey-plain 
4482

try-comm-central-win32 
5269

try-win32 
4030

try-win32-debug 
5602

try-win32-mulet 
3939

try-win32_gecko 
5619

try-win32_gecko-debug 
7148

try-win64 
4127

try-win64-debug 
4809

try_win32-debug_spidermonkey-compacting 
9028

try_win32_spidermonkey-compacting 
7302

try_win64-debug_spidermonkey-compacting 
8717
By build type ec2 r2.xlarge:


try-comm-central-win32 
5320

try-win32 
6130

try-win32-debug 
5606

try-win32-mulet 
4170

try-win32_gecko 
5849

try-win32_gecko-debug 
6038

try-win64 
5649

try-win64-debug 
5970

try_win32-debug_spidermonkey-compacting 
11369

try_win32-debug_spidermonkey-plaindebug 
7491

try_win32_spidermonkey-compacting 
5957

try_win32_spidermonkey-plain 
3083

try_win64-debug_spidermonkey-compacting 
6089

try_win64-debug_spidermonkey-plaindebug 
7499

try_win64_spidermonkey-plain 
5277
Relevant tests compared: 

EC2 R3.Xlarge			                        IX	
try_win32_spidermonkey-compacting	5957		try_win32_spidermonkey-compacting	7302
try_win32-debug_spidermonkey-compacting	11369		try_win32-debug_spidermonkey-compacting	9028
try_win64-debug_spidermonkey-compacting	6089		try_win64-debug_spidermonkey-compacting	8717
try-comm-central-win32	                5320		try-comm-central-win32	                5269
try-win32	                        6130		try-win32	                        4030
try-win32_gecko	                        5849		try-win32_gecko	                        5619
try-win32_gecko-debug	                6038		try-win32_gecko-debug	                7148
try-win32-debug	                        5606		try-win32-debug	                        5602
try-win32-mulet	                        4170		try-win32-mulet	                        3939
try-win64	                        5649		try-win64	                        4127
try-win64-debug	                        5970		try-win64-debug	                        4809
That is screaming for a spreadsheet (and I corrected your r2 to r3 in comment 12, since that seems to be what the instance type actually is):

https://docs.google.com/spreadsheets/d/1QQ2U13rmqo7OSTUmrFrwC_DvW31tScXX25yF_eRLIss

When you get the numbers for r3.2xlarge, please put them in there so we can do some easy comparisons.
Some times were added for r3.2xl, but we probably need some more data. Also going to investigate r3.4xl and the c3 types again since our problems with them previously were network constraints that might have been mitigated by the network patches Q deployed.
Using the following script to slice and dice data:

https://github.com/mozilla/buildapi-recent-stats-read
Flags: needinfo?(q)
C level performance looks good so far. There was an error with subnets and signing. I am having to redeploy my c3.xlarge
No redeploy yet as focus on puppett configs has taken precedence. I will get the C4.2xl numbers upload however, due to average build times going up across the board this data will be skewed.
Tables updated in original spread sheet. I will be stopping instances today to save cost.
Numbers captured during a relative period of performance for all types. I think we should have a good rough picture of speeds and this can now be reevaluated in a repeatable way. Closing this bug and future comparisons should live in their own.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.