Closed Bug 1545820 Opened 6 years ago Closed 6 years ago

High Pending Tasks on gecko-t-linux-large pool

Categories

(Infrastructure & Operations Graveyard :: CIDuty, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bcrisan, Unassigned)

Details

Approximately an hour ago the pending tasks on gecko-t-linux-large pool started to rise from 4600 tasks to 10470 tasks.

That should not be an issue but the running capacity is stuck at 318 instances.

I have checked to see if new modifications have arrived but I did not find anything obvious at the first look.

Trees have been already closed (including try).

IRC log:

•bcrisan|ciduty> Does someone know something about gecko-t-linux-large workers?
21:16 The pending just hit 10K mark and the running capacity is very low (about  300-320)
21:23 → Usul joined (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP)
21:25 
<•bcrisan|ciduty> dustin: any known issues with gecko-t-linux-large pool?
21:26 ↔ handyman nipped out  
21:34 
<•bcrisan|ciduty> created bug 1545820 for the above mentioned issue 
21:34 
<firebot> https://bugzil.la/1545820 — NEW, nobody@mozilla.org — High Pending Tasks on gecko-t-linux-large pool
21:34 ⇐ Usul quit (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP) Ping timeout: 121 seconds
21:38 
<•bcrisan|ciduty> CristianB|sheriffduty: dluca|sheriffduty ccoroiu|sheriffduty  trees should be closed for this  ^^
21:38 
<dluca|sheriffduty> bcrisan|ciduty: Ok, closing trees
21:41 bcrisan|ciduty set the topic: OnDuty: -- TREES ARE CLOSED: BUG 1545820 -- bcrisan --Unified #releng / #buildduty channels. CI issues? You’ve come to the right place. | This channel is logged at https://mozilla.logbot.info/ci
21:47 → Usul joined (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP)
21:48 
<•bcrisan|ciduty> !t-rex please look into bug 1545820 
21:48 
<firebot> https://bugzil.la/1545820 — NEW, nobody@mozilla.org — High Pending Tasks on gecko-t-linux-large pool
21:51 
<bstack> I’ll look in one sec!
21:55 ⇐ lizzard quit (ehenry@moz-csi.7im.245.63.IP) Client exited
21:56 
— bstack looks onw
21:58 
<dluca|sheriffduty> Looking at try i can see pending on jobs there too. So i'm going to close try as well
21:58 
— •bcrisan|ciduty thanks
21:59 
<bstack> ah, bunches of errors in the ec2-manager logs
22:00 lastModified for that workertype is 2019-04-19T16:20:53.278Z
22:01 
<•bcrisan|ciduty> what timezone is that?
22:02 I can see a pending trend to go up at 10:30 AM GMT 
22:02 
<bstack> that Z at the end makes it UTC
22:03 Kwan|away → Kwan|dinner
22:03 
<bstack> one sec, checking in aws to see what the error is

IRC log (part2):

bstack> huh, these errors raen't recent though
22:08 I'm going to terminate all of the workers in that pool
22:08 and restart provisioner/manager
22:08 
<dluca|sheriffduty> bstack:  Take a look at this bug first 
22:08 https://bugzilla.mozilla.org/show_bug.cgi?id=1545352
22:08 
<bstack> and then check back in 15 minutes or so. I don't see anything obviously wrong
22:08 
<firebot> Bug 1545352 — ASSIGNED, coop@mozilla.com — Pending builds on Linux, OSX and Android
22:08 
<•bcrisan|ciduty> ok, may this be related to what dluca|sheriffduty shared above?
22:09 
<bstack> yeah, likely related
22:11 oh, I see the failing instances now
22:12 but not much of an idea about why they're failing
22:12 they just report "internal error on startup"
22:13 maybe this should not have had the extra block device added yesterday?
22:13 I wish someone who did worker stuff was around
22:13 have any of these 318 intstances launched since the last update to config?
22:13 
<•bcrisan|ciduty> that's a good question
22:14 
<bstack> looks like the answer is no
22:14 oh, here's one that is starting now
22:14 let's see if it survices
22:14 survives*
22:15 
<dluca|sheriffduty> Fingers crossed
22:15 
— •bcrisan|ciduty plays _eye of the tiger_
22:15 
<dustin> hihi
22:15 ⇐ Usul quit (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP) Ping timeout: 121 seconds
22:15 
<dustin> so I modified that workerType to revert the maxCapacity after the bump Aryx requested yesterday.  I didn't change any other parameters.
22:16 
<bstack> ok, the new one instantly terminated
22:17 dustin: I think w.costa added some configuration around block devices yesterday evening
22:17 
<dustin> true
22:17 
<bstack> and misconfigured block devices is one of the things that can cause aws to just yell "ERROR" at you
22:17 with no logs
22:17 so I think we need to figure out what was changed and change it back
22:18 although that change was made since docker worker was failing to start up
22:18 
<dustin> hmm
22:19 so yesterday we had a new AMI deployed that is, in theory, just an update of the last-good version with the new *.taskcluster-worker.net cert
22:19 
<bstack> yeah,
22:19 but I think that theory must be wrong somehow
22:19 
<dustin> and a few hours later we have a config change to add block-device stuff
22:20 could we revert to the setting we had before that?  I wish we had pete's worker-type history :/
22:20 
<bstack> we can revert if we know what exactly changed
22:21 
<dustin> yeah :(
22:21 so the one that terminated, was there anything in its system.log?
22:21 
<bstack> no, it doesn't even get to the point where it would have a system log
22:21 aws fails to provision a system
22:22 there is conversation around 18:10 yesterday about this in this channel
22:22 between w.costa and c.oop
22:23 
<dustin> ah, but we have a bunch of running instances
22:23 can we steal the ami ID from there?
22:23 hm, i-014b4d4067b3df4f5 for example only has one block device (the root)
22:23 started 24h ago
22:24 and it's running XENqiHH9RsGiiRnscLQhBQ
22:25 
<•bcrisan|ciduty> how many jobs are those instances supposed to take?
22:25 https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-large/workers/us-east-1/i-0190d850cbbca3fe7
22:25 
<dustin> one task at a time
22:26 
<•bcrisan|ciduty> it was running 8 tasks, (test-linux32-shippable/opt-xpcshell-e10s-1 ) with different task id's 
22:26 only one was marked as completed, the rest of them are exceptions
22:27 
<dustin> normal exceptions or worker-related?
22:28 hm, claim-expired
22:30 well, I think that's unrelated to instances not starting up
22:30 but still weird
22:30 so, the first thing I can do is remove the BlockDeviceMapping from the config.. I'll try that now
22:32 ..and restart the provisioner
22:32 let's give that a few minutes to see if it works
22:33 if not, I can try reverting to the AMI ID I see on running instances
22:34 ← Alex_Gaynor left (sid1246@moz-7l35ei.irccloud.com): ""
22:45 
<dustin> I think it's helping..
22:47 
<bstack> dustin: am back
22:47 anything I can do now?
22:47 
<dustin> ok
22:48 I *think* reverting the workerType config fixed it
22:48 
<bstack> ok cool
22:48 
<dustin> I've seen about 400 instance provisioned since you left
22:48 and I'm also looking at the timeline
22:48 
<bstack> not sure what happened yesterday then
22:49 
<dustin> so yesterday in us-west-1 there were working instances with the new AMIs until 22:47 UTC at which time everything stopped provisioning -- does that line up with adding the BlockDeviceMappings bit?
22:49 and the older AMI works fine there
22:49 and after 19:32 UTC today, I see instances running
22:49 which is when I removed the BlockDeviceMapping clause
22:50 
<•bcrisan|ciduty> We'll keep the trees closed until half of the current pendings are done
22:51 
<dustin> same pattern in us-east-1
22:51 gecko-t-linux-xlarge doesn't have the BlockDeviceMapping
22:53 bstack: I bet what happened is that wcosta added that BDM to the gecko-3-b-linux being discussed, which *fixed* that workertype
22:53 but also added it to gecko-t-linux-large, which *broke* that workerType
22:53 
<bstack> ahhh
22:53 ok, everything makes sense now
22:54 
<dustin> bcrisan|ciduty: so I think that's a good plan

TL;DR:

  • Likely related to bug 1545352
  • misconfigured block devices is one of the things that can cause aws to just yell "ERROR"
  • when starting a new instance, it doesn't even get to the point where it would have a system log (so no logs here)
  • remove the BlockDeviceMapping from the config
  • restart the provisioner
  • reverting the workerType config fixed it
  • new instances start spinning up

Cause:

  • when BDM was added to the gecko-3-b-linux which fixed that workertype it was also introduced to gecko-t-linux-large, which broke that workerType

Trees will remain closed until a big chunk of tasks will get cleared (half of them, about 6K).

Trees have been reopened and everything looks good!

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

re-opened the bug as the issues re-appeared today : currently there is a backlog of 6824 pending tasks and 613 running capacity

RyanVM> yikes: https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-xlarge
21:44:50 6850 pending
21:49:19
<&apop|ciduty> Adrian Pop aki, I'm currently working on buildpuppet PR. I've found some changes on signing_scriptworker. Currently, I'm fixing a conflict. I'll let you know when I'm done
21:49:40
<aki> Aki Sasaki ok
21:50:36 ⇐ mayhemer quit (Miranda@moz-9mslpd.broadband6.iol.cz) Quit: Miranda NG! Smaller, Faster, Easier. http://miranda-ng.org/
21:51:51
<tomprince> Tom aki: I don't think there is anything blocking.
21:52:11
<aki> Aki Sasaki i have a minimized patch that just went green. i can send to phab
21:52:11
<RyanVM> apop|ciduty: any idea about that backlog? ^
21:52:49
<tomprince> Tom aki: Probably a stub makes sense, since it only supports hooks, at the moment, and ci-admin doesn't have support for creating them yet.
21:54:56
<&apop|ciduty> Adrian Pop I'm checking
21:57:56 → gbrown joined (gbrown@moz-e47qhv.cg.shawcable.net)
21:58:19
<&apop|ciduty> Adrian Pop bstack: can you please look on this ?
21:58:20 ⇐ gbrown quit (gbrown@moz-e47qhv.cg.shawcable.net) Connection closed
21:58:32
— bstack looks
21:58:41 → gbrown joined (gbrown@moz-e47qhv.cg.shawcable.net)
22:01:17
<bstack> Brian Stack restarted some provisioner stuff. will watch to see if that fixes things
22:01:30 the good news is that worker-manager landed in master of taskcluster today
22:01:42 so hopefully aws-provisioner is relatively short for this world
22:01:50 probably just a couple more months

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

dustin that's probably because we stopped it from using c3.xlarge
because there's a test that runs out of memory
walac [1:13 PM]
hrm, I remember last time we removed an instance from worker type, aws-provisioner started to fail to launch instances
dustin [1:15 PM]
yeah, I think it was the same workerTYpe
so I suspect the fix here is to add c3.2xlarge to the workerType
@coop ^^ that's double the cost.. is that OK?
it's a bit faster but probably most of the overhead is i/o so certainly not 2x as fast
coop [1:16 PM]
i think that's fine
dustin [1:17 PM]
ok
does so now

This was caused by removing c3.xlarge instances in bug 1551525, which don't have enough RAM (7.5G) to run some devtools tests (I thought 640k was enough for anoyone...). I assume we just can't get enough m3.xlarge's to meet the demand.

I've added c3.2xlarge (15GB), and hopefully that clears the backlog, as I commented int hat bug.

There was missing the c3.xlarge instance type as well. I can now see some pending instances.

Yeah, there are a ton of instances now. The backlog should be history before we know it. Thanks wcosta.

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED

OK, so I'll re-open bug 1551525 since c3.xlarge instances fail some tests.

The root cause of this bug as the same as in Bug 1545352 comment 2

Oh, I'm a fool. Sorry!

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.