Closed
Bug 974869
Opened 11 years ago
Closed 11 years ago
rewrite watch pending to cope better with spot requests that aren't being fulfilled
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Assigned: rail)
Details
Seems the integration trees are having some backlogs of jobs again (according to slave alloc about ±250 build jobs).
Closing the integration trees to give the trees a chance to catch up. Ben is also looking into it
Reporter | ||
Updated•11 years ago
|
Severity: normal → blocker
Reporter | ||
Comment 1•11 years ago
|
||
could this be related to this AWS maintenance mail we got earlier today ?
Comment 2•11 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #1)
> could this be related to this AWS maintenance mail we got earlier today ?
No, it's not.
When I looked at the status of our AWS machines, all of the bld-linux64 machines in use1 were started. Many in usw2 were started as well. I had a look at the watch pending log, and I need a bit of help interpreting it (catlee or rail?):
2014-02-20 03:40:48,952 - bld-linux64 - started 18 spot instances; need 0
2014-02-20 03:46:45,502 - bld-linux64 - started 18 spot instances; need 0
2014-02-20 03:50:57,806 - bld-linux64 - started 18 spot instances; need 0
2014-02-20 03:58:44,783 - bld-linux64 - started 34 spot instances; need 4
2014-02-20 03:59:00,383 - bld-linux64 - started 4 instances; need 0
2014-02-20 04:00:26,010 - bld-linux64 - started 2 spot instances; need 46
2014-02-20 04:01:11,061 - bld-linux64 - started 46 instances; need 0
2014-02-20 04:05:22,519 - bld-linux64 - started 2 spot instances; need 86
2014-02-20 04:06:39,757 - bld-linux64 - started 86 instances; need 0
2014-02-20 04:12:02,879 - bld-linux64 - started 17 spot instances; need 89
2014-02-20 04:12:56,123 - bld-linux64 - started 89 instances; need 0
2014-02-20 04:16:47,450 - bld-linux64 - started 17 spot instances; need 89
2014-02-20 04:17:36,458 - bld-linux64 - started 88 instances; need 1
2014-02-20 04:20:12,245 - bld-linux64 - started 0 spot instances; need 106
2014-02-20 04:20:34,520 - bld-linux64 - started 0 instances; need 106
Our pending job timeline looks roughly like this (I'm counting "linux" and "linux64" on the first graph):
~3:30am: ~140
~3:50am: ~80
~4:00am: ~200
~4:20am: 0
When I first logged into the aws console there was many (at least 50) on demand instances not started. I tried to start some and was told that there was insufficient capacity. Then the screen refreshed itself, and most of them were online (I guess I lost a race with watch pending?). I started the remaining few offline on demand instances, as we were at ~200 pending.
It almost seems like the on demands were slow to start because we were waiting for spot instances to try to start - but I don't know that for sure.
At this moment, everything looks fine again.
Severity: blocker → major
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
Reporter | ||
Comment 3•11 years ago
|
||
yeah confirmed seems we catched up - trees reopen 4:30 pst
Updated•11 years ago
|
Summary: Intregration Trees closed -> Linux Builder falling behind - aws issue ? → took longer than expected to start instances for bld-linux64 jobs
Comment 4•11 years ago
|
||
08:03 <@catlee> Tomcat|Sheriffduty, bhearsum: I think it's due to a problem in how we are doing spot requests
08:03 < bhearsum> ah
08:03 < bhearsum> that lines up with how relatively quickly it recovered on its own
08:04 <@catlee> we're slow to notice that spot requests aren't being fulfilled
08:08 < bhearsum> catlee: do we have any idea how to fix that, or are we just going to live with it for the forseeable future?
08:08 < bhearsum> and how long does it take to notice?
08:09 <@catlee> bhearsum: the fix is to rewrite watch_pending
08:09 <@catlee> it's possible to cope with
08:09 <@catlee> it's somewhere on rail's list, unless someone else steps up
Assignee: nobody → rail
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
Summary: took longer than expected to start instances for bld-linux64 jobs → rewrite watch pending to cope better with spot requests that aren't being fulfilled
Assignee | ||
Comment 5•11 years ago
|
||
I'm going to integrate better spot bidding algorithm this Friday what should help with this situation.
The bidding library is here: https://github.com/tarasglek/spotbidagent/blob/master/bid.py
and the related patch for our tools is here: https://github.com/rail/build-cloud-tools/compare/spot_bid?expand=1
Assignee | ||
Comment 6•11 years ago
|
||
This is in production now.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Component: Buildduty → General Automation
QA Contact: armenzg → catlee
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•