974869 - rewrite watch pending to cope better with spot requests that aren't being fulfilled

Reporter

Description

•

11 years ago

Seems the integration trees are having some backlogs of jobs again (according to slave alloc about ±250 build jobs). Closing the integration trees to give the trees a chance to catch up. Ben is also looking into it

Carsten Book [:Tomcat]

Reporter

Updated

•

11 years ago

Severity: normal → blocker

Carsten Book [:Tomcat]

Reporter

Comment 1

•

11 years ago

could this be related to this AWS maintenance mail we got earlier today ?

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

11 years ago

(In reply to Carsten Book [:Tomcat] from comment #1) > could this be related to this AWS maintenance mail we got earlier today ? No, it's not. When I looked at the status of our AWS machines, all of the bld-linux64 machines in use1 were started. Many in usw2 were started as well. I had a look at the watch pending log, and I need a bit of help interpreting it (catlee or rail?): 2014-02-20 03:40:48,952 - bld-linux64 - started 18 spot instances; need 0 2014-02-20 03:46:45,502 - bld-linux64 - started 18 spot instances; need 0 2014-02-20 03:50:57,806 - bld-linux64 - started 18 spot instances; need 0 2014-02-20 03:58:44,783 - bld-linux64 - started 34 spot instances; need 4 2014-02-20 03:59:00,383 - bld-linux64 - started 4 instances; need 0 2014-02-20 04:00:26,010 - bld-linux64 - started 2 spot instances; need 46 2014-02-20 04:01:11,061 - bld-linux64 - started 46 instances; need 0 2014-02-20 04:05:22,519 - bld-linux64 - started 2 spot instances; need 86 2014-02-20 04:06:39,757 - bld-linux64 - started 86 instances; need 0 2014-02-20 04:12:02,879 - bld-linux64 - started 17 spot instances; need 89 2014-02-20 04:12:56,123 - bld-linux64 - started 89 instances; need 0 2014-02-20 04:16:47,450 - bld-linux64 - started 17 spot instances; need 89 2014-02-20 04:17:36,458 - bld-linux64 - started 88 instances; need 1 2014-02-20 04:20:12,245 - bld-linux64 - started 0 spot instances; need 106 2014-02-20 04:20:34,520 - bld-linux64 - started 0 instances; need 106 Our pending job timeline looks roughly like this (I'm counting "linux" and "linux64" on the first graph): ~3:30am: ~140 ~3:50am: ~80 ~4:00am: ~200 ~4:20am: 0 When I first logged into the aws console there was many (at least 50) on demand instances not started. I tried to start some and was told that there was insufficient capacity. Then the screen refreshed itself, and most of them were online (I guess I lost a race with watch pending?). I started the remaining few offline on demand instances, as we were at ~200 pending. It almost seems like the on demands were slow to start because we were waiting for spot instances to try to start - but I don't know that for sure. At this moment, everything looks fine again.

Severity: blocker → major

Flags: needinfo?(rail)

Flags: needinfo?(catlee)

Carsten Book [:Tomcat]

Reporter

Comment 3

•

11 years ago

yeah confirmed seems we catched up - trees reopen 4:30 pst

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Summary: Intregration Trees closed -> Linux Builder falling behind - aws issue ? → took longer than expected to start instances for bld-linux64 jobs

bhearsum@mozilla.com (:bhearsum)

Comment 4

•

11 years ago

08:03 <@catlee> Tomcat|Sheriffduty, bhearsum: I think it's due to a problem in how we are doing spot requests 08:03 < bhearsum> ah 08:03 < bhearsum> that lines up with how relatively quickly it recovered on its own 08:04 <@catlee> we're slow to notice that spot requests aren't being fulfilled 08:08 < bhearsum> catlee: do we have any idea how to fix that, or are we just going to live with it for the forseeable future? 08:08 < bhearsum> and how long does it take to notice? 08:09 <@catlee> bhearsum: the fix is to rewrite watch_pending 08:09 <@catlee> it's possible to cope with 08:09 <@catlee> it's somewhere on rail's list, unless someone else steps up

Assignee: nobody → rail

Flags: needinfo?(rail)

Flags: needinfo?(catlee)

Summary: took longer than expected to start instances for bld-linux64 jobs → rewrite watch pending to cope better with spot requests that aren't being fulfilled

Rail Aliiev [:rail]

Assignee

Comment 5

•

11 years ago

I'm going to integrate better spot bidding algorithm this Friday what should help with this situation. The bidding library is here: https://github.com/tarasglek/spotbidagent/blob/master/bid.py and the related patch for our tools is here: https://github.com/rail/build-cloud-tools/compare/spot_bid?expand=1

Rail Aliiev [:rail]

Assignee

Comment 6

•

11 years ago

This is in production now.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Rail Aliiev [:rail]

Assignee

Updated

•

11 years ago

Component: Buildduty → General Automation

QA Contact: armenzg → catlee

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

Bugzilla

rewrite watch pending to cope better with spot requests that aren't being fulfilled

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: cbook, Assigned: rail)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Updated

Updated