Closed Bug 1159384 Opened 10 years ago Closed 9 years ago

Capture available build time comparisons for build slaves in AWS versus Colo hardware

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows Server 2008
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: q, Assigned: q)

References

()

Details

(Whiteboard: [windows])

Attachments

(1 file)

No description provided.
Depends on: 1124303
Numbers cor total build times worked out during the recent Portland Work week: Ninty five percentile calculated try end to end build times on windows 2008 AWS region USE1 in play ( slow network issues) - 7000 seconds AWS region USW2 only - 5403 seconds AWS region USW2 r3.2xlarge instance types with SSDs - 5100 IX hardware in slc3 - 5611
Assignee: relops → q
So based on the first pass it appears that instance types in AWS with no known network issues are faster than our hardware in colo numbers. However, these jobs need a deeper breakdown of type etc.. As a proof of concept I believe AWS 2008 builders are a success. However, in troubleshooting the AWS regional network issues we lost our sterile environment. I believe the numbers are encouraging and that a controlled staging environment should be built in both the colo and AWS USW2 focusing on instance types r3.xlarge and higher. Based on conversations with Releng management there is no set standard for these metrics or prescribed method for retrieval. I have been referred to Nick for some analysis training and to https://secure.pub.build.mozilla.org/builddata/buildjson/ to start pulling data after the tests environments are settled.
Let's re-instantiate the USW2 instances, put them in staging, and run direct comparison jobs there and in SCL3. Can we please also test out using local storage (I'm not sure if we'll have enough) for builds and get numbers on that?
Flags: needinfo?(q)
Attached file trace1.txt
Q -> AWS -------- 4/16/2015 Traces captured during a slow uplaod
Q: with the stack tuning that we've now done, what do our numbers look like?
Whiteboard: [windows]
Here is the current successful try build times in seconds with a 95 percentile measurement: IX machines: 11648 EC2 r3.Xlarge: 6594 EC2 r3.2Xlarge: Need more data ( being collected now ) These numbers are based on a smaller sample size than is ideal but show a promising trend. The median shows a less dramatic difference. I have created a parse and calc python script so we can keep tabs on these measurements as our sample size grows. Data gathered from: https://secure.pub.build.mozilla.org/buildapi/recent/
Flags: needinfo?(q)
What specific build type is this for?
All successful builds. I can narrow the scope during parse. Do you have any suggestions for which type I should focus on ?
By build type IX: b2g-inbound-win32 4367 b2g-inbound-win32-debug 5669 b2g-inbound-win32-mulet 4934 b2g-inbound-win32-pgo 15769 b2g-inbound-win32_gecko 6564 b2g-inbound-win32_gecko-debug 6564 b2g-inbound-win64 6813 b2g-inbound-win64-debug 6695 b2g-inbound-win64-pgo 15932 comm-aurora-win32-l10n-dep 586 comm-beta-win32 5347 comm-beta-win32-debug 6514 comm-central-win32 6251 comm-central-win32-debug 6327 comm-esr38-win32 10066 comm-esr38-win32-debug 9301 cypress-win64-debug 6922 fuzzer-win64-rev2 2042 fx-team-win32 6360 fx-team-win32-debug 5693 fx-team-win32-mulet 3144 fx-team-win32-pgo 15745 fx-team-win32_gecko-debug 6799 fx-team-win64 5396 fx-team-win64-debug 6860 fx-team-win64-pgo 15750 fx-team_win32-debug_spidermonkey-compacting 8696 jamun_win32-debug_spidermonkey-compacting 8940 mozilla-aurora-win32 11861 mozilla-aurora-win32-l10n-dep 1299 mozilla-aurora-win64 16934 mozilla-aurora-win64-l10n-dep 485 mozilla-b2g32_v2_0-win32_gecko-nightly 11877 mozilla-b2g34_v2_1-win32_gecko-nightly 14482 mozilla-b2g37_v2_2-win32-mulet-nightly 5868 mozilla-beta-win32 13712 mozilla-beta-win64 19285 mozilla-central-win32-mulet-nightly 4954 mozilla-central-win32-pgo 19254 mozilla-central-win32_gecko-nightly 7276 mozilla-central-win64-pgo 19260 mozilla-inbound-win32 6170 mozilla-inbound-win32-debug 7552 mozilla-inbound-win32-mulet 5265 mozilla-inbound-win32-pgo 13499 mozilla-inbound-win32_gecko 7127 mozilla-inbound-win32_gecko-debug 8401 mozilla-inbound-win64 7302 mozilla-inbound-win64-debug 7076 mozilla-inbound-win64-pgo 15608 mozilla-inbound_win32-debug_spidermonkey-compacting 7718 mozilla-inbound_win32-debug_spidermonkey-plaindebug 4954 mozilla-inbound_win32_spidermonkey-plain 4482 try-comm-central-win32 5269 try-win32 4030 try-win32-debug 5602 try-win32-mulet 3939 try-win32_gecko 5619 try-win32_gecko-debug 7148 try-win64 4127 try-win64-debug 4809 try_win32-debug_spidermonkey-compacting 9028 try_win32_spidermonkey-compacting 7302 try_win64-debug_spidermonkey-compacting 8717
By build type ec2 r2.xlarge: try-comm-central-win32 5320 try-win32 6130 try-win32-debug 5606 try-win32-mulet 4170 try-win32_gecko 5849 try-win32_gecko-debug 6038 try-win64 5649 try-win64-debug 5970 try_win32-debug_spidermonkey-compacting 11369 try_win32-debug_spidermonkey-plaindebug 7491 try_win32_spidermonkey-compacting 5957 try_win32_spidermonkey-plain 3083 try_win64-debug_spidermonkey-compacting 6089 try_win64-debug_spidermonkey-plaindebug 7499 try_win64_spidermonkey-plain 5277
Relevant tests compared: EC2 R3.Xlarge IX try_win32_spidermonkey-compacting 5957 try_win32_spidermonkey-compacting 7302 try_win32-debug_spidermonkey-compacting 11369 try_win32-debug_spidermonkey-compacting 9028 try_win64-debug_spidermonkey-compacting 6089 try_win64-debug_spidermonkey-compacting 8717 try-comm-central-win32 5320 try-comm-central-win32 5269 try-win32 6130 try-win32 4030 try-win32_gecko 5849 try-win32_gecko 5619 try-win32_gecko-debug 6038 try-win32_gecko-debug 7148 try-win32-debug 5606 try-win32-debug 5602 try-win32-mulet 4170 try-win32-mulet 3939 try-win64 5649 try-win64 4127 try-win64-debug 5970 try-win64-debug 4809
That is screaming for a spreadsheet (and I corrected your r2 to r3 in comment 12, since that seems to be what the instance type actually is): https://docs.google.com/spreadsheets/d/1QQ2U13rmqo7OSTUmrFrwC_DvW31tScXX25yF_eRLIss When you get the numbers for r3.2xlarge, please put them in there so we can do some easy comparisons.
Some times were added for r3.2xl, but we probably need some more data. Also going to investigate r3.4xl and the c3 types again since our problems with them previously were network constraints that might have been mitigated by the network patches Q deployed.
Using the following script to slice and dice data: https://github.com/mozilla/buildapi-recent-stats-read
Flags: needinfo?(q)
C level performance looks good so far. There was an error with subnets and signing. I am having to redeploy my c3.xlarge
No redeploy yet as focus on puppett configs has taken precedence. I will get the C4.2xl numbers upload however, due to average build times going up across the board this data will be skewed.
Tables updated in original spread sheet. I will be stopping instances today to save cost.
Numbers captured during a relative period of performance for all types. I think we should have a good rough picture of speeds and this can now be reevaluated in a repeatable way. Closing this bug and future comparisons should live in their own.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: