Closed Bug 1077869 Opened 8 years ago Closed 8 years ago

docker-worker: Formatting instance storage fails intermittently


(Taskcluster :: Workers, defect)

Not set


(Not tracked)



(Reporter: jlal, Assigned: garndt)




(1 file)

52 bytes, text/x-github-pull-request
Details | Review
At least once now (see node i-75a12c7a in papertrail) I have seen a worker node start without actually booting the docker-worker and pulling tasks off the queue. We should investigate why this is happening and actually iron out what we would like the aws-provisioner to do to nodes which become "zombies".
also i-330e833c, i-cf0b86c0, i-0e0d8001 (this one did not even register a hostname on papertrail!), i-a2179aad, i-aa2ba6a5, i-e93db0e6, i-3856da37.

Clearly something is preventing upstart from even _starting_ the docker-worker/docker (it starts diamond at least in some of these cases).
Blocks: 1080265
Appears that the logical volume in the upstart job is getting created, but then the next step to format it is not happening.  Not sure why yet.  I didn't find anything in the logs indicating a larger failure.

Zombie instance:

root@ip-172-31-4-29:/home/ubuntu# lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT,LABEL
NAME                          FSTYPE      SIZE MOUNTPOINT LABEL
xvda                                        8G            
└─xvda1                       ext4          8G /          cloudimg-rootfs
xvdb                          LVM2_member  75G            
└─instance_storage-all (dm-0)              75G       

Good instance:

NAME                          FSTYPE      SIZE MOUNTPOINT LABEL
xvda                                        8G            
└─xvda1                       ext4          8G /          cloudimg-rootfs
xvdb                          LVM2_member  75G            
└─instance_storage-all (dm-0) ext4         75G /mnt
Hrm I wonder if this is something we can retry somehow... It looks like the process was at least partially completed.
We should be able to capture the status code and retry a few times.  I can add some logic around it and perhaps some useful logging to try to narrow down the conditions a failure happens and to at least let us know that's exactly what failed and was retried.
++ I think we need some kind of retry logic for this particular case we may or may not have to destroy the lvm "disk" and start over... I also wonder if this is actually failing in ext4 formatting if there maybe is some verbose logging we can try.
It seems to me we should figure out why it fails sometimes.. This isn't a network thing it's not suppose to onyl work "sometimes".

If we can detect failure to format disk right we can just force shutdown. Thought this can create a spawn/shutdown loop... I hate those :)
We absolutely should figure out why it's failing.  Magical failing things are too mysterious for my liking :)
Assignee: nobody → garndt
Attached file GH PR 54
Over the course of a couple of days have spawns over a thousand nodes using a graph of 170 tasks and haven't had a zombie yet.  Perhaps it could be tested out with the tasks that were running on the gaia worker type.  I created this ami with the changes: ami-25185715
Attachment #8506989 - Flags: review?(jlal)
More info about the error..  It appears that the upstart job that created the logical volume was being respawned the first time because mkfs could not start because the system was reporting the volume was currently in use.  Every respawn after that failed because the commands to create the logical volume would fail because the LV was already present triggering subsequent respawns and never attempting mkfs again.
Comment on attachment 8506989 [details] [review]
GH PR 54

Few comments here ... can you flag me again for r? after reviewing my comments
Attachment #8506989 - Flags: review?(jlal)
Closed: 8 years ago
Resolution: --- → FIXED
Summary: docker-worker: Investigate zombie nodes → docker-worker: Formatting instance storage fails intermittently
Blocks: 1089251
Component: TaskCluster → Docker-Worker
Product: Testing → Taskcluster
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.