Closed Bug 1431742 Opened 7 years ago Closed 6 years ago

Huge numbers of spot requests and few running workers, especially on windows

Categories

(Taskcluster :: Services, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: grenade)

References

Details

This is an issue with EC2 with never fulfilling our spot requests.

Trees are closed for this.

https://tools.taskcluster.net/aws-provisioner/

There are only 37 gecko-t-win7-32-gpu workers running, it's a full virtual capacity because the remaining capacity are spot requests. There are 3500 pending jobs.

For gecko-1-b-win2012, there are 264 running jobs and 69 pending, this also looks suspicious.

This affects both both win7 and win10. "gecko-t-linux-large" has also 500 blue ones, but no backlog.
Component: Queue → AWS-Provisioner
To be clear, this is an issue on the EC2 side, not in our systems.  We're in contact with EC2 about why these requests are taking so long to resolve.  I can comment when we hear back about a reason and if there are any steps to take.
As a diagnostic, I've cancelled a lot of instances which were stuck in a pending-evaluation state for a long time.  I have left some around for debugging and also to try to avoid flooding the API with new requests.
I've run the following query:

delete from spotrequests where created < now() - interval '1 hour';

on the production database as well as manually cancelling some of the oldest requests.

This query will make the ec2-manager ignore these pending instances for about an hour, when they'll be re-added by the housekeeper.
EC2 has replied and has taken action to correct this issue.
We have confirmation that this issue has been resolved for us-west-2 and us-east-1 where it was presenting previously.  us-west-1 has not yet been repaired and so I will disable it for production.
us-west-1 was disabled in the provisioner.  I've also deleted all spot request records from us-west-1 from ec2-manager and will cancel all pending requests in that region
We are seeing new spot requests being fulfilled with success.
As part of an AMI cleanup, it seems that some AMIs still configured in production workerTypes were deleted.  This prevented new instances from starting up after the fixes a few hours ago.  Affected workerTypes:

2018-01-19T19:34:40.366651+00:00 app[web.1]: 19:34:40.366Z  WARN aws-provisioner-production: found worker types which cannot launch, ignoring them
2018-01-19T19:34:40.366653+00:00 app[web.1]:     disabled: [
2018-01-19T19:34:40.366655+00:00 app[web.1]:       "b2gtest",
2018-01-19T19:34:40.366656+00:00 app[web.1]:       "deepspeech-worker",
2018-01-19T19:34:40.366657+00:00 app[web.1]:       "gecko-1-b-android",
2018-01-19T19:34:40.366658+00:00 app[web.1]:       "gecko-1-b-linux",
2018-01-19T19:34:40.366659+00:00 app[web.1]:       "gecko-1-images",
2018-01-19T19:34:40.366660+00:00 app[web.1]:       "gecko-3-b-android",
2018-01-19T19:34:40.366661+00:00 app[web.1]:       "gecko-3-b-macosx64",
2018-01-19T19:34:40.366662+00:00 app[web.1]:       "gecko-t-linux-large",
2018-01-19T19:34:40.366663+00:00 app[web.1]:       "gecko-t-linux-xlarge",
2018-01-19T19:34:40.366664+00:00 app[web.1]:       "github-worker",
2018-01-19T19:34:40.366666+00:00 app[web.1]:       "releng-svc",
2018-01-19T19:34:40.366667+00:00 app[web.1]:       "taskcluster-generic",
2018-01-19T19:34:40.366668+00:00 app[web.1]:       "taskcluster-images",
2018-01-19T19:34:40.366669+00:00 app[web.1]:       "tutorial"
2018-01-19T19:34:40.366670+00:00 app[web.1]:     ]
The next issue was that Windows instances were busted.  :gps made https://github.com/mozilla-releng/OpenCloudConfig/pull/116 to try to roll some of that back, and also edited the workerType definitions to refer to old, working AMIs.  Tasks seem to be running on Windows now.
Only gecko-t-win7-32-gpu is known to be fixed. And it may have regressed due to the deploy of the AMI with the reverts. The old AMI before a few hours ago was from late October.

It's quite possible that the roll out of busted Windows AMIs (that AFAICT "hung" on startup and didn't accept any work) caused a bunch of cascading issues in the provisioner.
Things have finally restored to somewhat normalcy. Backlogs are ~0. Sheriffs have started gently opening trees. Tasks on new pushes are running. Things aren't obviously blowing up.

As Dustin pointed out, a series of OpenCloudConfig commits (e741979420, a3453c1e0) this morning triggered building and deployment of new AMIs for the gecko-t-win7-32, gecko-t-win7-32-gpu, gecko-t-win10-64, gecko-t-win10-64-gpu, and gecko-t-win10-64-gpu-b worker types.

I'm not sure what, but *something* in these commits was bad. The generated AMIs didn't accept any work. The screenshot tool in the AWS web console showed the instances at the "booting" screen. It certainly looked like the instances were hanging on startup.

jhford noticed that these Windows instances were terminating after 5-6 minutes of uptime. I'm not sure what mechanism was causing them to terminate, but it certainly looks like *something* (either us or Amazon) detected the instances never booted up and terminated them.

This appears (although jhford is the expert here) to have led to an infinite cycle of sorts where the provisioner keep respawing faulting instances. Normally instances are alive for at least an hour. But in this case, instances would only be around for 5-6 minutes. So, this potentially increases the load on the provisioner for these worker types by up to 10x.

My theory is that this continuous respawning of bad instances contributed to a significant or even complete part of the provisioner's troubles today. Supporting my claim are a few observations.

First, at the end of the day, jhford thought the provisioner was deader than dead. The proposed mitigation was a bit risky. So we decided to give older, known-working AMIs a try. It was a slow process, but as we replaced the bad AMIs with good ones, the provisioner starting behaving somewhat rationally again. Things were still spotty. But by the end of the day after all the bad AMIs were purged and we finally had enough workers spawned to start churning through the backlog, the provisioner was almost back to normal. Correlation does not imply causation. But it smells like there was a relationship between the bad AMIs and instances and provisioner's ailments.

Second (and probably more convincing), the AMI updates from the OCC commits happened *before* things started blowing up. https://tools.taskcluster.net/groups/LDn6pjxUTCK0QmwAQDWC8w is the first set of deploys. https://tools.taskcluster.net/groups/UreyjaD9TRiPAOziauszqQ is the 2nd set of deploys.

Relevant times (all UTC) for AMI updates on those bad deploys:

e74197942
12:00 gecko-t-win7-32 https://tools.taskcluster.net/groups/LDn6pjxUTCK0QmwAQDWC8w/tasks/MjKwUW-rRBmKoAQblnmRTw/runs/0
11:45 gecko-t-win7-32-gpu https://tools.taskcluster.net/groups/LDn6pjxUTCK0QmwAQDWC8w/tasks/NVV6N1ytQD-f5FLv9mUuRQ/runs/0
11:54 gecko-t-win10-64 https://tools.taskcluster.net/groups/LDn6pjxUTCK0QmwAQDWC8w/tasks/VGQJaIzRSsmBGXBWfICfNw/runs/0
12:01 gecko-t-win10-64-gpu https://tools.taskcluster.net/groups/LDn6pjxUTCK0QmwAQDWC8w/tasks/S10xp1-iRSOjnbk7z-OKmA/runs/0

a3453c1e0
12:49 gecko-t-win10-64-gpu https://tools.taskcluster.net/groups/UreyjaD9TRiPAOziauszqQ/tasks/NGc-Wa7ERaCAJC-AfKltlg/runs/0
13:07 gecko-t-win10-64-gpu-b https://tools.taskcluster.net/groups/UreyjaD9TRiPAOziauszqQ/tasks/dBrA4dBHQKCswxIXnL0GkA/runs/0

When the deploys are made, it takes several minutes for new instances to be booted up and to accept new work. I can't recall how long old instances linger around.

Aryx made noise in #taskcluster about backlog at 14:16 UTC. This is quite notably *after* all the Windows AMI deployments were made. Again, it certainly looks like the rapid termination of instances due to bad AMIs could be the root cause here.

A remediation to the problem was reverting to good AMIs for the impacted worker types.

I performed a lot of the actions to restore AMIs. It was a very laborious process.

Dustin and I initially tried to revert the Git commits in OpenCloudConfig and hope a deploy would fix things. However, we realized that the deploy would take ~50 minutes. Furthermore, we realized some of the old AMIs were very old. gecko-t-win7-32-gpu was last on an AMI from October 27! We didn't want to wait ~50 minutes to start recovery. And we realized that there was no guarantee that new AMIs would work since we hadn't done an AMI update in so long. We decided we wanted to roll back to last known good AMIs.

Figuring out the last known good AMIs was difficult. After poking through the deploy logs (which are conveniently Taskcluster tasks and linked next to the GitHub commits that triggered them), I realized the logs record the current AMIs and the new AMIs. So, I had to look through the Git commit log, identify which commits triggered deploys (it is done via the commit message syntax), find those commits in the GitHub web ui, follow the link to the taskgraph, find tasks for the workers being deployed, and then find the log lines indicating the old and new AMIs. I initially did this as a one-off for gecko-t-win7-32-gpu, as it had the largest backlog and therefore I thought it was the most important worker to fix. I updated the AMIs in the AWS Provisioner web UI, manually terminated EC2 instances running the old AMIs, and waited. Within a few minutes AWS provisioner created new instances and Pulse events told me that tasks were running on these new instances. That was a very encouraging sign. I actually don't think it was until this time that I was confident that the AMIs were the culprit. I did have pretty circumstantial evidence because an OCC deploy was made earlier. And the fact workers were terminating fit the bill of a bad AMI. So rolling back to a known working config felt like a prudent course of action.

I eventually established https://public.etherpad-mozilla.org/p/windows-ami-restore to help track the process of rolling back AMIs. Going through multiple TC task logs to find AMIs was laborious. Then I had to manually update worker configs in the AWS web UI. I know there are automated ways to do this. But I'm not sure what scripts exist for that. It also didn't help that the provisioner was periodically dying due to apparent load issues at the time. So I couldn't reliably update configs in the web UI.

As part of manually updating AMIs, I made at least one mistake. I assigned the gecko-t-win10-64-gpu-b AMIs to the gecko-t-win10-64-gpu worker. This resulted in no change in behavior AFAICT: instances would spawn up and get terminated shortly thereafter (I think).

A complication was that the latest known-working AMI for gecko-t-win10-64-gpu no longer existed in us-east-1 and us-west-1. Looking at the AWS web console, I couldn't find an AMI built before today. But I did find an older AMI in us-west-2. So gecko-t-win10-64-gpu is now running in us-west-2 only because that's the only region with known working AMIs.

Similarly, gecko-t-win10-64-gpu-b is no longer running in us-west-1 due to lack of a good AMI candidate.

While things are fresh in my mind from fighting this fire, here are some potential remediation ideas to discuss at the post-mortem:

* Better notification of AMI deploys. There was 0 notification in #taskcluster that a change was being rolled out. I think we want better visibility into when the infrastructure is changing so that when things blow up, we can look at what changed recently to better identify culprits for weirdness.
* Better notification of pending tasks. I'm not sure how long pending task counts were climbing before Aryx made noise in #taskcluster. Quicker identification of this event could lead to more rapid triage and resolution of problems.
* Better notification of failure to run tasks to completion. As soon as the bad AMIs took hold, we weren't completing tasks - nor starting - tasks for those worker types. This should be a smoking gun that something is wrong.
* Better monitoring of minimum instance uptimes. If an instance doesn't live for a minimum period of time or doesn't accept work, that's a signal that something is off. Some kind of monitoring and alerting here feels reasonable.
* Easier rollback of bad AMI updates. This applies to Windows and docker-worker. In the event that an AMI update is bad, we need to be able to revert that change as quickly as possible. This should be automated and multiple TC admins should be able to initiate that process.
* More intelligent purging of old AMIs. The AMI rollback procedure was hindered in some cases because good AMIs were deleted and no longer available! Because we did multiple deploys from OCC today and OCC deployment mechanism appears to purge all but the last N AMIs, good AMIs that we wanted to roll back to were not available. I think we should preserve AMIs for at least N days after they were last used. N should probably be at least 30 days. Probably more like 90 in case we need to test things on older AMIs.
* Mass updating worker AMIs for all OCC changes. We had a worker running an AMI from October 27. Others were running AMIs from a few days ago. There have been numerous changes to OCC since October. Any of those changes could have broken that specific worker type. We should consider striving for uniformity among our AMI configuration so surprises don't occur long after changes are made to OCC.
* Better testing of AMIs before they are put in production and/or adherence to existing testing process. Putting an AMI in production immediately after generation sounds risky. I know our AMI generation is not deterministic. There should be a smoketest of sorts on an AMI before it goes into production and can result in a tree closure.
* Ability to update worker configs when provisioner web UI is offline. I wanted to update AMIs when jhford took the provisioner offline as a means of "hard reset." There really wasn't a way to update the config without the web UI because the web UI apparently relies on the provisioner running. (I'm not clear of technical reality here.)

grenade: you have some work cut out for you on Monday! We'll need working AMIs for gecko-t-win10-64-gpu restored in use1 and usw1 and for the worker config to be updated to re-enable those regions. Ditto for gecko-t-win10-64-gpu-b in usw1. The deleted worker config snippets are at https://public.etherpad-mozilla.org/p/windows-ami-restore. In addition, I'm not sure if the manual muckery of worker configs messed with OCC's deployment mechanism. The cancellation of some deployment tasks in https://tools.taskcluster.net/groups/bHARn70ORxGzfIxMZMit_g may have also left things in a bad state.
Flags: needinfo?(rthijssen)
i'm grateful for this very detailed and insightful analysis of what went wrong. i'd also like to apologise for causing the problems due to commits i made on friday. i will work on testing some patches over the weekend with a view to getting some new amis tested and deployed before monday.

i agree with pretty much all the suggestions in the remediation points above and will work towards implementing these in the coming days.
Flags: needinfo?(rthijssen)
Assignee: nobody → rthijssen
(In reply to Rob Thijssen (:grenade UTC+2) from comment #12)
> i will work on testing some patches over the weekend with a
> view to getting some new amis tested and deployed before monday.

Things seem to be in a relatively happy state right now. I'd encourage you to wait until Monday before working on anything and especially before changing anything in production :)

> i agree with pretty much all the suggestions in the remediation points above
> and will work towards implementing these in the coming days.

Excellent! Just as long as those days don't start before Monday: I don't think there's a pressing need to do any of this over the weekend. Also, keep in mind we have a post-mortem on Monday. It's likely we'll revise and prioritize the remediation items then.
I sort of hope that the fixes here caused bug 1431950, which is currently closing trees, since if they didn't that must mean someone landed some busted infra change while we were closed for this.
Depends on: 1431950
In the chaos of yesterday, new docker-worker AMIs were also rolled out. They shouldn't have been. But hindsight is 20/20. Recent changes likely regressed something.

I see bstack and others are looking into a deploy of docker-worker right now. They shouldn't deploy from master until Monday because we don't want the per-second billing changes to go out without people around to babysit.

They should instead branch off an older docker-worker commit - say c79bc7ca74000b57955e1763f88f60ec4b005ca0. Although some other fixes have gone in since then.

If they want to deploy from master, they should revert the billing cycle commits. That's merge 0f4cbdb59a0b5afebc6ddd2b26acda60b8cf66c3.
the problem appears to have been with unmapped drives on windows 7.

for some time we've had an intermittent problem with a failure to map the task (z:) and/or cache (y:) drives on win 7 only. cause is still undetermined but being intermittent, i previously bandaged it with a check on boot for the presence of the drives. if they're missing, the instance is terminated since it's useless in that state.

on friday i rolled new amis with a patch for an unrelated problem. however the drive mapping issue appeared to go from intermittent to permafail on the new amis. i don't yet know for sure why and can only guess that the problem is a race condition whereby occ checks for the existence of the drives before ec2config has had a chance to map them. my hunch is that for some reason occ is running earlier or faster on the new amis and running the drive check earlier than it did on older amis.

i'm working on some tests that introduce a wait loop in occ to see if waiting some length of time for ec2config to map the drives, results in the drives being detected and mapped correctly. i will update this bug if the patch is successful or proves to be a red herring idea.
introducing a wait loop confirmed the race condition theory. the drive letters are eventually mapped after ec2config gets round to mounting the volumes.
I believe this is not currently and issue so am reducing the priority. Can this be closed, :grenade?
Severity: blocker → major
Flags: needinfo?(rthijssen)
yes
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(rthijssen)
Resolution: --- → FIXED
Component: AWS-Provisioner → Services
You need to log in before you can comment on or make changes to this bug.