High Pending Tasks on gecko-t-linux-large pool
Categories
(Infrastructure & Operations Graveyard :: CIDuty, defect, P1)
Tracking
(Not tracked)
People
(Reporter: bcrisan, Unassigned)
Details
Approximately an hour ago the pending tasks on gecko-t-linux-large
pool started to rise from 4600 tasks to 10470 tasks.
That should not be an issue but the running capacity is stuck at 318 instances.
I have checked to see if new modifications have arrived but I did not find anything obvious at the first look.
Reporter | ||
Comment 1•6 years ago
|
||
Trees have been already closed (including try).
Reporter | ||
Comment 2•6 years ago
|
||
IRC log:
•bcrisan|ciduty> Does someone know something about gecko-t-linux-large workers?
21:16 The pending just hit 10K mark and the running capacity is very low (about 300-320)
21:23 → Usul joined (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP)
21:25
<•bcrisan|ciduty> dustin: any known issues with gecko-t-linux-large pool?
21:26 ↔ handyman nipped out
21:34
<•bcrisan|ciduty> created bug 1545820 for the above mentioned issue
21:34
<firebot> https://bugzil.la/1545820 — NEW, nobody@mozilla.org — High Pending Tasks on gecko-t-linux-large pool
21:34 ⇐ Usul quit (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP) Ping timeout: 121 seconds
21:38
<•bcrisan|ciduty> CristianB|sheriffduty: dluca|sheriffduty ccoroiu|sheriffduty trees should be closed for this ^^
21:38
<dluca|sheriffduty> bcrisan|ciduty: Ok, closing trees
21:41 bcrisan|ciduty set the topic: OnDuty: -- TREES ARE CLOSED: BUG 1545820 -- bcrisan --Unified #releng / #buildduty channels. CI issues? You’ve come to the right place. | This channel is logged at https://mozilla.logbot.info/ci
21:47 → Usul joined (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP)
21:48
<•bcrisan|ciduty> !t-rex please look into bug 1545820
21:48
<firebot> https://bugzil.la/1545820 — NEW, nobody@mozilla.org — High Pending Tasks on gecko-t-linux-large pool
21:51
<bstack> I’ll look in one sec!
21:55 ⇐ lizzard quit (ehenry@moz-csi.7im.245.63.IP) Client exited
21:56
— bstack looks onw
21:58
<dluca|sheriffduty> Looking at try i can see pending on jobs there too. So i'm going to close try as well
21:58
— •bcrisan|ciduty thanks
21:59
<bstack> ah, bunches of errors in the ec2-manager logs
22:00 lastModified for that workertype is 2019-04-19T16:20:53.278Z
22:01
<•bcrisan|ciduty> what timezone is that?
22:02 I can see a pending trend to go up at 10:30 AM GMT
22:02
<bstack> that Z at the end makes it UTC
22:03 Kwan|away → Kwan|dinner
22:03
<bstack> one sec, checking in aws to see what the error is
Reporter | ||
Comment 3•6 years ago
|
||
IRC log (part2):
bstack> huh, these errors raen't recent though
22:08 I'm going to terminate all of the workers in that pool
22:08 and restart provisioner/manager
22:08
<dluca|sheriffduty> bstack: Take a look at this bug first
22:08 https://bugzilla.mozilla.org/show_bug.cgi?id=1545352
22:08
<bstack> and then check back in 15 minutes or so. I don't see anything obviously wrong
22:08
<firebot> Bug 1545352 — ASSIGNED, coop@mozilla.com — Pending builds on Linux, OSX and Android
22:08
<•bcrisan|ciduty> ok, may this be related to what dluca|sheriffduty shared above?
22:09
<bstack> yeah, likely related
22:11 oh, I see the failing instances now
22:12 but not much of an idea about why they're failing
22:12 they just report "internal error on startup"
22:13 maybe this should not have had the extra block device added yesterday?
22:13 I wish someone who did worker stuff was around
22:13 have any of these 318 intstances launched since the last update to config?
22:13
<•bcrisan|ciduty> that's a good question
22:14
<bstack> looks like the answer is no
22:14 oh, here's one that is starting now
22:14 let's see if it survices
22:14 survives*
22:15
<dluca|sheriffduty> Fingers crossed
22:15
— •bcrisan|ciduty plays _eye of the tiger_
22:15
<dustin> hihi
22:15 ⇐ Usul quit (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP) Ping timeout: 121 seconds
22:15
<dustin> so I modified that workerType to revert the maxCapacity after the bump Aryx requested yesterday. I didn't change any other parameters.
22:16
<bstack> ok, the new one instantly terminated
22:17 dustin: I think w.costa added some configuration around block devices yesterday evening
22:17
<dustin> true
22:17
<bstack> and misconfigured block devices is one of the things that can cause aws to just yell "ERROR" at you
22:17 with no logs
22:17 so I think we need to figure out what was changed and change it back
22:18 although that change was made since docker worker was failing to start up
22:18
<dustin> hmm
22:19 so yesterday we had a new AMI deployed that is, in theory, just an update of the last-good version with the new *.taskcluster-worker.net cert
22:19
<bstack> yeah,
22:19 but I think that theory must be wrong somehow
22:19
<dustin> and a few hours later we have a config change to add block-device stuff
22:20 could we revert to the setting we had before that? I wish we had pete's worker-type history :/
22:20
<bstack> we can revert if we know what exactly changed
22:21
<dustin> yeah :(
22:21 so the one that terminated, was there anything in its system.log?
22:21
<bstack> no, it doesn't even get to the point where it would have a system log
22:21 aws fails to provision a system
22:22 there is conversation around 18:10 yesterday about this in this channel
22:22 between w.costa and c.oop
22:23
<dustin> ah, but we have a bunch of running instances
22:23 can we steal the ami ID from there?
22:23 hm, i-014b4d4067b3df4f5 for example only has one block device (the root)
22:23 started 24h ago
22:24 and it's running XENqiHH9RsGiiRnscLQhBQ
22:25
<•bcrisan|ciduty> how many jobs are those instances supposed to take?
22:25 https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-large/workers/us-east-1/i-0190d850cbbca3fe7
22:25
<dustin> one task at a time
22:26
<•bcrisan|ciduty> it was running 8 tasks, (test-linux32-shippable/opt-xpcshell-e10s-1 ) with different task id's
22:26 only one was marked as completed, the rest of them are exceptions
22:27
<dustin> normal exceptions or worker-related?
22:28 hm, claim-expired
22:30 well, I think that's unrelated to instances not starting up
22:30 but still weird
22:30 so, the first thing I can do is remove the BlockDeviceMapping from the config.. I'll try that now
22:32 ..and restart the provisioner
22:32 let's give that a few minutes to see if it works
22:33 if not, I can try reverting to the AMI ID I see on running instances
22:34 ← Alex_Gaynor left (sid1246@moz-7l35ei.irccloud.com): ""
22:45
<dustin> I think it's helping..
22:47
<bstack> dustin: am back
22:47 anything I can do now?
22:47
<dustin> ok
22:48 I *think* reverting the workerType config fixed it
22:48
<bstack> ok cool
22:48
<dustin> I've seen about 400 instance provisioned since you left
22:48 and I'm also looking at the timeline
22:48
<bstack> not sure what happened yesterday then
22:49
<dustin> so yesterday in us-west-1 there were working instances with the new AMIs until 22:47 UTC at which time everything stopped provisioning -- does that line up with adding the BlockDeviceMappings bit?
22:49 and the older AMI works fine there
22:49 and after 19:32 UTC today, I see instances running
22:49 which is when I removed the BlockDeviceMapping clause
22:50
<•bcrisan|ciduty> We'll keep the trees closed until half of the current pendings are done
22:51
<dustin> same pattern in us-east-1
22:51 gecko-t-linux-xlarge doesn't have the BlockDeviceMapping
22:53 bstack: I bet what happened is that wcosta added that BDM to the gecko-3-b-linux being discussed, which *fixed* that workertype
22:53 but also added it to gecko-t-linux-large, which *broke* that workerType
22:53
<bstack> ahhh
22:53 ok, everything makes sense now
22:54
<dustin> bcrisan|ciduty: so I think that's a good plan
TL;DR:
- Likely related to bug 1545352
- misconfigured block devices is one of the things that can cause aws to just yell "ERROR"
- when starting a new instance, it doesn't even get to the point where it would have a system log (so no logs here)
- remove the BlockDeviceMapping from the config
- restart the provisioner
- reverting the workerType config fixed it
- new instances start spinning up
Cause:
- when BDM was added to the gecko-3-b-linux which fixed that workertype it was also introduced to gecko-t-linux-large, which broke that workerType
Trees will remain closed until a big chunk of tasks will get cleared (half of them, about 6K).
Reporter | ||
Comment 4•6 years ago
|
||
Trees have been reopened and everything looks good!
Comment 5•6 years ago
•
|
||
re-opened the bug as the issues re-appeared today : currently there is a backlog of 6824 pending tasks and 613 running capacity
Comment 6•6 years ago
|
||
RyanVM> yikes: https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-xlarge
21:44:50 6850 pending
21:49:19
<&apop|ciduty> Adrian Pop aki, I'm currently working on buildpuppet PR. I've found some changes on signing_scriptworker. Currently, I'm fixing a conflict. I'll let you know when I'm done
21:49:40
<aki> Aki Sasaki ok
21:50:36 ⇐ mayhemer quit (Miranda@moz-9mslpd.broadband6.iol.cz) Quit: Miranda NG! Smaller, Faster, Easier. http://miranda-ng.org/
21:51:51
<tomprince> Tom aki: I don't think there is anything blocking.
21:52:11
<aki> Aki Sasaki i have a minimized patch that just went green. i can send to phab
21:52:11
<RyanVM> apop|ciduty: any idea about that backlog? ^
21:52:49
<tomprince> Tom aki: Probably a stub makes sense, since it only supports hooks, at the moment, and ci-admin doesn't have support for creating them yet.
21:54:56
<&apop|ciduty> Adrian Pop I'm checking
21:57:56 → gbrown joined (gbrown@moz-e47qhv.cg.shawcable.net)
21:58:19
<&apop|ciduty> Adrian Pop bstack: can you please look on this ?
21:58:20 ⇐ gbrown quit (gbrown@moz-e47qhv.cg.shawcable.net) Connection closed
21:58:32
— bstack looks
21:58:41 → gbrown joined (gbrown@moz-e47qhv.cg.shawcable.net)
22:01:17
<bstack> Brian Stack restarted some provisioner stuff. will watch to see if that fixes things
22:01:30 the good news is that worker-manager landed in master of taskcluster today
22:01:42 so hopefully aws-provisioner is relatively short for this world
22:01:50 probably just a couple more months
Updated•6 years ago
|
Comment 7•6 years ago
|
||
dustin that's probably because we stopped it from using c3.xlarge
because there's a test that runs out of memory
walac [1:13 PM]
hrm, I remember last time we removed an instance from worker type, aws-provisioner started to fail to launch instances
dustin [1:15 PM]
yeah, I think it was the same workerTYpe
so I suspect the fix here is to add c3.2xlarge to the workerType
@coop ^^ that's double the cost.. is that OK?
it's a bit faster but probably most of the overhead is i/o so certainly not 2x as fast
coop [1:16 PM]
i think that's fine
dustin [1:17 PM]
ok
does so now
Comment 8•6 years ago
|
||
This was caused by removing c3.xlarge instances in bug 1551525, which don't have enough RAM (7.5G) to run some devtools tests (I thought 640k was enough for anoyone...). I assume we just can't get enough m3.xlarge's to meet the demand.
I've added c3.2xlarge (15GB), and hopefully that clears the backlog, as I commented int hat bug.
Comment 9•6 years ago
|
||
There was missing the c3.xlarge instance type as well. I can now see some pending instances.
Comment 10•6 years ago
|
||
Yeah, there are a ton of instances now. The backlog should be history before we know it. Thanks wcosta.
Comment 11•6 years ago
|
||
OK, so I'll re-open bug 1551525 since c3.xlarge instances fail some tests.
Comment 12•6 years ago
|
||
The root cause of this bug as the same as in Bug 1545352 comment 2
Comment 13•6 years ago
|
||
Oh, I'm a fool. Sorry!
Updated•5 years ago
|
Description
•