Closed Bug 974869 Opened 11 years ago Closed 11 years ago

rewrite watch pending to cope better with spot requests that aren't being fulfilled

Categories

(Release Engineering :: General, defect)

x86
Linux
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: rail)

Details

Seems the integration trees are having some backlogs of jobs again (according to slave alloc about ±250 build jobs). Closing the integration trees to give the trees a chance to catch up. Ben is also looking into it
Severity: normal → blocker
could this be related to this AWS maintenance mail we got earlier today ?
(In reply to Carsten Book [:Tomcat] from comment #1) > could this be related to this AWS maintenance mail we got earlier today ? No, it's not. When I looked at the status of our AWS machines, all of the bld-linux64 machines in use1 were started. Many in usw2 were started as well. I had a look at the watch pending log, and I need a bit of help interpreting it (catlee or rail?): 2014-02-20 03:40:48,952 - bld-linux64 - started 18 spot instances; need 0 2014-02-20 03:46:45,502 - bld-linux64 - started 18 spot instances; need 0 2014-02-20 03:50:57,806 - bld-linux64 - started 18 spot instances; need 0 2014-02-20 03:58:44,783 - bld-linux64 - started 34 spot instances; need 4 2014-02-20 03:59:00,383 - bld-linux64 - started 4 instances; need 0 2014-02-20 04:00:26,010 - bld-linux64 - started 2 spot instances; need 46 2014-02-20 04:01:11,061 - bld-linux64 - started 46 instances; need 0 2014-02-20 04:05:22,519 - bld-linux64 - started 2 spot instances; need 86 2014-02-20 04:06:39,757 - bld-linux64 - started 86 instances; need 0 2014-02-20 04:12:02,879 - bld-linux64 - started 17 spot instances; need 89 2014-02-20 04:12:56,123 - bld-linux64 - started 89 instances; need 0 2014-02-20 04:16:47,450 - bld-linux64 - started 17 spot instances; need 89 2014-02-20 04:17:36,458 - bld-linux64 - started 88 instances; need 1 2014-02-20 04:20:12,245 - bld-linux64 - started 0 spot instances; need 106 2014-02-20 04:20:34,520 - bld-linux64 - started 0 instances; need 106 Our pending job timeline looks roughly like this (I'm counting "linux" and "linux64" on the first graph): ~3:30am: ~140 ~3:50am: ~80 ~4:00am: ~200 ~4:20am: 0 When I first logged into the aws console there was many (at least 50) on demand instances not started. I tried to start some and was told that there was insufficient capacity. Then the screen refreshed itself, and most of them were online (I guess I lost a race with watch pending?). I started the remaining few offline on demand instances, as we were at ~200 pending. It almost seems like the on demands were slow to start because we were waiting for spot instances to try to start - but I don't know that for sure. At this moment, everything looks fine again.
Severity: blocker → major
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
yeah confirmed seems we catched up - trees reopen 4:30 pst
Summary: Intregration Trees closed -> Linux Builder falling behind - aws issue ? → took longer than expected to start instances for bld-linux64 jobs
08:03 <@catlee> Tomcat|Sheriffduty, bhearsum: I think it's due to a problem in how we are doing spot requests 08:03 < bhearsum> ah 08:03 < bhearsum> that lines up with how relatively quickly it recovered on its own 08:04 <@catlee> we're slow to notice that spot requests aren't being fulfilled 08:08 < bhearsum> catlee: do we have any idea how to fix that, or are we just going to live with it for the forseeable future? 08:08 < bhearsum> and how long does it take to notice? 08:09 <@catlee> bhearsum: the fix is to rewrite watch_pending 08:09 <@catlee> it's possible to cope with 08:09 <@catlee> it's somewhere on rail's list, unless someone else steps up
Assignee: nobody → rail
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
Summary: took longer than expected to start instances for bld-linux64 jobs → rewrite watch pending to cope better with spot requests that aren't being fulfilled
I'm going to integrate better spot bidding algorithm this Friday what should help with this situation. The bidding library is here: https://github.com/tarasglek/spotbidagent/blob/master/bid.py and the related patch for our tools is here: https://github.com/rail/build-cloud-tools/compare/spot_bid?expand=1
This is in production now.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Buildduty → General Automation
QA Contact: armenzg → catlee
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.