Closed
Bug 935533
Opened 11 years ago
Closed 10 years ago
Prepare infra to handle spot instances
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rail, Assigned: rail)
References
Details
Attachments
(6 files)
1.71 KB,
patch
|
catlee
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
131.11 KB,
text/plain
|
Details | |
2.23 KB,
text/plain
|
Details | |
438.27 KB,
text/plain
|
Details | |
1.03 KB,
patch
|
catlee
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
27.71 KB,
patch
|
catlee
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
let's use tst-linux{32,64}-spot naming for them * add to slavealloc * add to biuldbot * add to puppet * create EC2 network interfaces * add A, PTR and CNAME entries
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → rail
Assignee | ||
Comment 1•11 years ago
|
||
Attachment #828069 -
Flags: review?(catlee)
Updated•11 years ago
|
Attachment #828069 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 2•11 years ago
|
||
FTR, added to slavealloc.
Assignee | ||
Comment 3•11 years ago
|
||
Comment on attachment 828069 [details] [diff] [review] buildbot-configs https://hg.mozilla.org/build/buildbot-configs/rev/6f5b7a2a1c8e
Attachment #828069 -
Flags: checked-in+
Assignee | ||
Comment 4•11 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #0) > * add to puppet Not needed. I used a different domain for my staging slaves, this is why it failed to sync.
Assignee | ||
Comment 5•11 years ago
|
||
To have more available IPs in the designated IP range I added 4 /25 subnets in use1 (5{8,9}.{0,128}/25) and 1 59.0/24 subnet in usw2.
Assignee | ||
Comment 6•11 years ago
|
||
I used this script to generate and tag EC2 network interfaces and generate invtool commands to add DNS entries (output incoming).
Assignee | ||
Comment 7•11 years ago
|
||
invtool commands (still running)
Assignee | ||
Comment 8•11 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #7) > Created attachment 828206 [details] > spots.sh > > invtool commands (still running) done in 43:57 8-)
Comment 9•11 years ago
|
||
in production
Assignee | ||
Comment 10•11 years ago
|
||
Attachment #828827 -
Flags: review?(catlee)
Comment 11•11 years ago
|
||
Comment on attachment 828827 [details] [diff] [review] aws_stop_idle.py changes Review of attachment 828827 [details] [diff] [review]: ----------------------------------------------------------------- I wonder if we should add another tag to identify these instead of relying on regexes....
Attachment #828827 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 12•11 years ago
|
||
Some good progress here. I *manually* requested 4 instances yesterday, 2x$0.025 + 2x$0.035. ATM, 2 of them (1x$0.025 + 1x$0.035) are still alive: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tst-linux64-spot-415 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tst-linux64-spot-351 and 2 were killed: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tst-linux64-spot-306 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tst-linux64-spot-465 Buildbot retried 2 interrupted jobs: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=d8df595005dc&jobname=Ubuntu%20ASAN%20VM%2012.04%20x64%20mozilla-inbound%20opt%20test%20mochitest-5 https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=125d0f9767b5&jobname=Ubuntu%20VM%2012.04%20x64%20mozilla-inbound%20debug%20test%20reftest
Assignee | ||
Comment 13•11 years ago
|
||
Comment on attachment 828827 [details] [diff] [review] aws_stop_idle.py changes https://hg.mozilla.org/build/cloud-tools/rev/89f8828d1670
Attachment #828827 -
Flags: checked-in+
Assignee | ||
Comment 14•11 years ago
|
||
First 32-bit slave worked fine: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tst-linux32-spot-396
Assignee | ||
Comment 15•11 years ago
|
||
linux64 in another region (us-east-1) looks fine: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tst-linux64-spot-173
Assignee | ||
Comment 16•11 years ago
|
||
works fine in --dryrun mode! :)
Attachment #830615 -
Flags: review?(catlee)
Assignee | ||
Comment 17•11 years ago
|
||
Comment on attachment 830615 [details] [diff] [review] watch pending Review of attachment 830615 [details] [diff] [review]: ----------------------------------------------------------------- ::: aws/configs/watch_pending.cfg @@ +21,5 @@ > "^Ubuntu ASAN VM 12.04 x64.*": "tst-linux64" > + }, > + "spot_limits": { > + "us-east-1": { > + "tst-linux64": { These limits are different for servo, this is why they are here and not in the config/* files.
Comment 18•11 years ago
|
||
Comment on attachment 830615 [details] [diff] [review] watch pending Review of attachment 830615 [details] [diff] [review]: ----------------------------------------------------------------- Looks good! Just a few nits. I'm worried that this will break servo, so make sure that's tested first. ::: aws/aws_watch_pending.py @@ +33,3 @@ > query = sa.text(""" > SELECT buildername, count(*) FROM > + buildrequests, builds WHERE you need to join on builds here? at least we need to add condition in the WHERE clause to prevent joining all of buildrequests to all of builds. @@ +279,5 @@ > + moz_instance_type=moz_instance_type)) > + can_be_started = region_limit - active_count > + if can_be_started < 1: > + log.debug("Active spot request count in %s region hit limit of %s." > + " Active count: %s", region, region_limit, active_count) can you make this be explicit that it's not starting any spot? @@ +282,5 @@ > + log.debug("Active spot request count in %s region hit limit of %s." > + " Active count: %s", region, region_limit, active_count) > + continue > + > + to_be_started = min([can_be_started, start_count - started]) no need for a list here, I don't think? @@ +326,5 @@ > + > + spec = NetworkInterfaceSpecification( > + network_interface_id=interface.id) > + nc = NetworkInterfaceCollection(spec) > + user_data = 'FQDN="%s"' % fqdn the instance can't get this from DNS?
Attachment #830615 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 19•11 years ago
|
||
Comment on attachment 830615 [details] [diff] [review] watch pending https://hg.mozilla.org/build/cloud-tools/rev/d09b772e5067
Attachment #830615 -
Flags: checked-in+
Assignee | ||
Comment 20•10 years ago
|
||
Status update: We are almost ready to go and gradually increase amount of spot instances for tests slaves. There is one unknown metric we need to track -- how often Amazon kills our instances (read how often we lose our jobs). Amazon's spot logs don't have enough information to measure this. I added a task to the spot sanity check cronjob to poll and log spot request and update their state and status whenever they change so we can get some ration how often we get "instance-terminated-by-price". http://hg.mozilla.org/build/cloud-tools/file/3850b33833a2/aws/spot_sanity_check.py#l105 I'm going to analyze the results on Monday and depending on that proceed with the plan to gradually increase amount of spot instances.
Assignee | ||
Comment 21•10 years ago
|
||
As a rough way to measure impact of spot instances we can use the build JSON files from http://builddata.pub.build.mozilla.org/buildjson/ and look for result == 5 (RETRY). Some figure for December: date,spot retries,spot jobs, spot+tst-linux jobs 2013-12-01,3,594,1690 2013-12-02,32,1070,9804 2013-12-03,69,1073,14119 2013-12-04,37,1112,11774 2013-12-05,106,914,15686 2013-12-06,134,2477,14270 2013-12-07,24,1989,9163 2013-12-08,19,1961,4540
Assignee | ||
Comment 22•10 years ago
|
||
date,spot retries,spot jobs, spot+tst-linux jobs 2013-12-09,77,2712,13424
Assignee | ||
Comment 23•10 years ago
|
||
date,spot retries,spot jobs, spot+tst-linux jobs 2013-12-10,179,4474,16745 I bumped the limits twice: https://hg.mozilla.org/build/cloud-tools/rev/be3d37160cc0
Assignee | ||
Comment 24•10 years ago
|
||
Some stats since Friday: Total requests: 7968 Fulfilled: 3031 Not Fulfilled: 4622 Terminated by price: 964 Terminated by user: 2015
Assignee | ||
Comment 25•10 years ago
|
||
This is for Wed, Dec 11 total ubuntu jobs, jobs on spots, spot retries 15068, 4645 (30%), 279 (6%)
Comment 26•10 years ago
|
||
Something here is in production.
Assignee | ||
Comment 27•10 years ago
|
||
Thu, Dec 12: total ubuntu jobs, jobs on spots, spot retries 15750, 6220 (39%), 689 (11%) Looks like we get more jobs interrupted. As a possible solution we can bump the price and see how it affects the figures.
Assignee | ||
Comment 28•10 years ago
|
||
Fri, Dec 13: total ubuntu jobs, jobs on spots, spot retries 15019, 6079 (40%), 586 (9%)
Comment 29•10 years ago
|
||
Plus the (relatively small) content of bug 851431, since we don't RETRY on that message, so we manually retrigger (or not).
Assignee | ||
Comment 30•10 years ago
|
||
Sat, Dec 14: total ubuntu jobs, jobs on spots, spot retries 5742, 4320 (75%), 44 (1%)
Assignee | ||
Comment 31•10 years ago
|
||
Sun, Dec 15 total ubuntu jobs, jobs on spots, spot retries 2974, 2821 (94%), 36 (1%)
Assignee | ||
Comment 32•10 years ago
|
||
Mon, Dec 16 total jobs, jobs on spots, spot retries 10837, 7361 (67%), 200 (2%)
Assignee | ||
Comment 33•10 years ago
|
||
Tue, Dec 17 total jobs, jobs on spots, spot retries 15938, 9301 (58%), 533 (5%)
Assignee | ||
Comment 34•10 years ago
|
||
Wed (18) and Thu (19) total jobs, jobs on spots, spot retries 16289, 9239 (56%), 414 (4%) 14652, 10711 (73%), 360 (3%) Getting better!
Assignee | ||
Comment 35•10 years ago
|
||
date, total jobs, jobs on spots, spot retries 2013-12-20, 12912, 9104 (70%), 482 (5%) 2013-12-21, 5781, 5248 (90%), 35 (0%) 2013-12-22, 2395, 2362 (98%), 16 (0%) 2013-12-23, 4439, 4381 (98%), 16 (0%) 2013-12-24, 2586, 2482 (95%), 25 (1%) 2013-12-25, 1983, 1929 (97%), 14 (0%) 2013-12-26, 1326, 1188 (89%), 44 (3%) 2013-12-27, 1583, 1534 (96%), 14 (0%) 2013-12-28, 2217, 2186 (98%), 17 (0%) 2013-12-29, 1576, 1553 (98%), 12 (0%) 2013-12-30, 2566, 2449 (95%), 47 (1%) 2013-12-31, 2827, 2733 (96%), 41 (1%)
Assignee | ||
Comment 36•10 years ago
|
||
FTR, after talking to Taras regarding AWS spot pricing model we decided to bump the bid prices up to 10¢ for m1.medium (vs 0.12¢ on-demand price). This price actually represents our highest price we will to pay in case the "market" price goes up. This should improve out retry stats. I bumped adjusted the prices on Friday, Jan 3rd, PT afternoon. date, total jobs, jobs on spots, spot retries 2014-01-01, 1374, 1365 (99%), 6 (0%) 2014-01-02, 8104, 7536 (92%), 63 (0%) 2014-01-03, 9946, 9172 (92%), 75 (0%) 2014-01-04, 4776, 4509 (94%), 20 (0%) 2014-01-05, 1460, 1409 (96%), 20 (1%) The stats are still not representative due to low load.
Assignee | ||
Comment 37•10 years ago
|
||
Updated stats for m1.medium (test slaves): tst-linux(32|64) date, total jobs, jobs on spots, spot retries, o-d retries 2014-01-01, 1374, 1365 (99%), 6 (0%), 0 (0%) 2014-01-02, 8104, 7536 (92%), 63 (0%), 4 (0%) 2014-01-03, 9946, 9172 (92%), 75 (0%), 29 (3%) 2014-01-04, 4776, 4509 (94%), 20 (0%), 0 (0%) 2014-01-05, 1460, 1409 (96%), 20 (1%), 0 (0%) 2014-01-06, 8309, 6939 (83%), 231 (3%), 6 (0%) 2014-01-07, 14010, 11138 (79%), 468 (4%), 1 (0%) 2014-01-08, 15808, 12015 (76%), 606 (5%), 1 (0%) 2014-01-09, 14838, 12048 (81%), 450 (3%), 1 (0%) 2014-01-10, 14875, 12926 (86%), 297 (2%), 0 (0%) 2014-01-11, 6394, 6174 (96%), 25 (0%), 0 (0%) 2014-01-12, 4076, 4018 (98%), 30 (0%), 0 (0%) I'm going to increase the limits (requires some DNS, etc. work)
Assignee | ||
Comment 38•10 years ago
|
||
FTR, these are our current kill ratios. They have been drastically changed since we started. I "blame" the bidding library added in bug 972562 and the spot market health checks from bug 978971 (we don't try to request new spot instances if the market price is higher than 80% of our maximum price and we don't use availability zones where more than 10% of our spot request in last 15 minutes are not fulfilled either by price or capacity related issues). ^bld-linux64 (builders) date, total jobs, jobs on spots, spot retries, o-d retries 2014-03-01, 1725, 1356 (78%), 2 (0%), 2 (0%) 2014-03-02, 1036, 762 (73%), 1 (0%), 0 (0%) 2014-03-03, 2564, 2046 (79%), 68 (3%), 0 (0%) 2014-03-04, 3263, 2636 (80%), 27 (1%), 1 (0%) 2014-03-05, 2987, 2306 (77%), 38 (1%), 2 (0%) 2014-03-06, 3456, 2688 (77%), 29 (1%), 1 (0%) 2014-03-07, 3003, 2425 (80%), 10 (0%), 1 (0%) 2014-03-08, 1303, 951 (72%), 0 (0%), 0 (0%) 2014-03-09, 998, 685 (68%), 0 (0%), 0 (0%) 2014-03-10, 2282, 1966 (86%), 15 (0%), 0 (0%) 2014-03-11, 2730, 2385 (87%), 2 (0%), 0 (0%) 2014-03-12, 2883, 2616 (90%), 9 (0%), 0 (0%) 2014-03-13, 3109, 2728 (87%), 3 (0%), 0 (0%) ^try-linux64 (try builders) date, total jobs, jobs on spots, spot retries, o-d retries 2014-03-01, 285, 285 (100%), 0 (0%), 0 (0%) 2014-03-02, 180, 180 (100%), 0 (0%), 0 (0%) 2014-03-03, 696, 696 (100%), 17 (2%), 0 (0%) 2014-03-04, 821, 821 (100%), 11 (1%), 0 (0%) 2014-03-05, 1246, 1246 (100%), 0 (0%), 0 (0%) 2014-03-06, 1122, 1122 (100%), 1 (0%), 0 (0%) 2014-03-07, 1170, 1170 (100%), 0 (0%), 0 (0%) 2014-03-08, 468, 468 (100%), 0 (0%), 0 (0%) 2014-03-09, 149, 149 (100%), 0 (0%), 0 (0%) 2014-03-10, 805, 797 (99%), 0 (0%), 0 (0%) 2014-03-11, 1043, 1023 (98%), 0 (0%), 0 (0%) 2014-03-12, 1502, 1502 (100%), 2 (0%), 0 (0%) 2014-03-13, 1781, 1781 (100%), 4 (0%), 0 (0%) ^tst-linux(32|64) (test machines) date, total jobs, jobs on spots, spot retries, o-d retries 2014-03-01, 9176, 8742 (95%), 1 (0%), 0 (0%) 2014-03-02, 6003, 5994 (99%), 4 (0%), 0 (0%) 2014-03-03, 13668, 11304 (82%), 6 (0%), 1 (0%) 2014-03-04, 15757, 14275 (90%), 11 (0%), 8 (0%) 2014-03-05, 16630, 14979 (90%), 36 (0%), 15 (0%) 2014-03-06, 17547, 16748 (95%), 17 (0%), 9 (1%) 2014-03-07, 16331, 16016 (98%), 8 (0%), 2 (0%) 2014-03-08, 7125, 6983 (98%), 1 (0%), 0 (0%) 2014-03-09, 3602, 3599 (99%), 2 (0%), 0 (0%) 2014-03-10, 12504, 11865 (94%), 36 (0%), 1 (0%) 2014-03-11, 14050, 13958 (99%), 19 (0%), 0 (0%) 2014-03-12, 16662, 15367 (92%), 23 (0%), 8 (0%) 2014-03-13, 20173, 17361 (86%), 92 (0%), 11 (0%)
Comment 39•10 years ago
|
||
do the testers take longer than builders, or are the retries due to other issues?
Assignee | ||
Comment 40•10 years ago
|
||
The testers take much shorter usually. A very little higher ratio may be explained by their cheaper price (more people bid for them). BTW, the retry status here is not a 100% indicator of a killed instance, but I'd say it's a good indicator.
Assignee | ||
Comment 41•10 years ago
|
||
We are ROCK SOLID now! :)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•